LLaMA- Open and Efficient Foundation Language Models（英）.pdf - 外文报告外文报告外文报告 - 报告派

报告

扫码访问

立即注册

登录

首页外文报告外文报告LLaMA- Open and Efficient Foundation Language Models（英）

热门文档

2023-07-27 17:45:35 软通智慧&人大：2023年AIGC赋能城市治理应用白皮书
2023-08-17 23:12:35 安居客研究院：2023年7月全国新房市场报告
2023-09-01 00:34:53 怡安：2023怡安国企改革新观察
2023-08-03 22:31:05 校果研究院：2023大学生夏日消费洞察报告
2023-08-26 00:31:14 CPEA&药渡：2023年H1中国生物医药投融资蓝皮书
2023-07-08 17:31:42 上海国家会计学院：2023影响中国会计行业的十大信息技术评选报告
2023-07-21 23:38:19 RetailX：2023年英国电子商务报告
2023-07-31 21:59:30 58安居客研究院：2023年6月全国二手房市场月报
2023-08-12 17:56:53 Sensor Tower：2023年全球热门手游题材市场洞察报告
2023-01-11 18:47:28 财富早知道：白宫突然宣布将对中国500亿商品征25%关税
2023-08-10 21:54:32 中电联电力发展研究院：中国电力企业低碳转型实践研究报告2022
2023-07-18 10:35:52 亿欧智库：2023中国AI商业落地投资价值研究报告

PDF

LLaMA- Open and Efficient Foundation Language Models（英）

2023-03-28 小*理 27页 606.86K

下载文档 收藏文档

请阅读并同意免责条款

违规举报 用手机查看 分享文档

下载文档

/ 27

 全屏查看

LLaMA- Open and Efficient Foundation Language Models（英）



还有 27 页未读，您可以继续阅读或下载文档

下载文档继续在线阅读

下载提示文本预览

报告派所有资源均由用户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。
1、本文档共计 27 页，下载后文档不带水印，支持完整阅读内容或进行编辑。
2、当您下载文档后，并不意味着拥有了版权，文档仅供网友学习交流，不得用于其他商业用途。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！
4、本本所有内容不构成任何投资建议，不具有任何指导和买卖意见。
5、如文档内容存在违规，或者侵犯商业秘密、侵犯著作权等，请点击“违规举报”。

2 ApproachDatasetSampling prop.Epochs Disk sizeOur training approach is similar to the methodsCommonCrawl67.0%1.103.3TBdescribed in previous work (Brown et al.,2020;C415.0%1.06783GBChowdhery et al.,2022),and is inspired by theGithub4.5%0.64328GBChinchilla scaling laws (Hoffmann et al..2022).Wikipedia4.5%2.4583 GBWe train large transformers on a large quantity ofBooks4.5%2.2385 GBtextual data using a standard optimizer.ArXiv2.5%1.0692 GBStackExchange2.0%1.0378 GB2.1 Pre-training DataOur training dataset is a mixture of several sources.Table 1:Pre-training data.Data mixtures used for pre-reported in Table 1,that cover a diverse set of do-training,for each subset we list the sampling proportion,mains.For the most part,we reuse data sourcesnumber of epochs performed on the subset when train-that have been leveraged to train other LLMs,withing on 1.4T tokens,and disk size.The pre-training runson IT tokens have the same sampling proportion.the restriction of only using data that is publiclyavailable,and compatible with open sourcing.Thisleads to the following mixture of data and the per-languages,which use either the Latin or Cyrilliccentage they represent in the training set:scripts:bg,ca,cs,da,de,en,es,fr,hr,hu,it,English CommonCrawl [67%].We preprocessnl,pl,pt,ro,ru,s1,sr,sv,uk.We process thefive CommonCrawl dumps,ranging from 2017data to remove hyperlinks,comments and otherto 2020,with the CCNet pipeline (Wenzek et al.,formatting boilerplate.2020).This process deduplicates the data at theGutenberg and Books3 [4.5%].We include twoline level,perfomms language identification withbook corpora in our training dataset:the Guten-a fastText linear classifier to remove non-Englishberg Project,which contains books that are in thepages and filters low quality content with an n-public domain,and the Books3 section of TheP-gram language model.In addition,we trained aile (Gao et al.,2020),a publicly available datasetlinear model to classify pages used as referencesfor training large language models.We performin Wikipedia v.s.randomly sampled pages,anddeduplication at the book level,removing booksdiscarded pages not classified as references.with more than 90%content overlap.C4 [15%].During exploratory experiments,weobserved that using diverse pre-processed Com-ArXiv [2.5%].We process arXiv Latex filesmonCrawl datasets improves performance.We thusto add scientific data to our dataset.Followingincluded the publicly available C4 dataset(RaffelLewkowycz et al.(2022),we removed everythinget al.,2020)in our data.The preprocessing of C4before the first section,as well as the bibliography.We also removed the comments from the .tex files,al so contains deduplication and language identifi-cation steps:the main difference with CCNet isand inline-expanded definitions and macros writtenthe quality filtering,which mostly relies on heuris-by users to increase consistency across papers.tics such as presence of punctuation marks or theStack Exchange [2%].We include a dump ofnumber of words and sentences in a webpage.Stack Exchange,a website of high quality ques-Github [4.5%]We use the public GitHubtions and answers that covers a diverse set of do-dataset available on Google BigQuery.We onlymains,ranging from computer science to chemistry.kept projects that are distributed under the Apache,We kept the data from the 28 largest websites,re-BSD and MIT licenses.Additionally,we filteredmoved the HTML tags from text and sorted thelow quality files with heuristics based on the lineanswers by score (from highest to lowest).length or proportion of alphanumeric characters,and removed boilerplate,such as headers,with reg-Tokenizer.We tokenize the data with the byte-ular expressions.Finally,we deduplicate the result-pair encoding (BPE)algorithm (Sennrich et al.,ing dataset at the file level,with exact matches.2015),using the implementation from Sentence-Piece (Kudo and Richardson,2018).Notably,weWikipedia [4.5%].We add Wikipedia dumpssplit all numbers into individual digits,and fallbackfrom the June-August 2022 period,covering 20to bytes to decompose unknown UTF-8 characters.

返回顶部