• 首页
  • 报告
  • 资讯
  • 快讯
  • 图表
  • 网址导航
首页外文报告外文报告LLaMA- Open and Efficient Foundation Language Models(英)
PDF

LLaMA- Open and Efficient Foundation Language Models(英)

小*理 27页 606.86K

下载文档
/ 27
全屏查看
LLaMA- Open and Efficient Foundation Language Models(英)
还有 27 页未读 ,您可以 继续阅读 或 下载文档
报告派所有资源均由用户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
1、本文档共计 27 页,下载后文档不带水印,支持完整阅读内容或进行编辑。
2、当您下载文档后,并不意味着拥有了版权,文档仅供网友学习交流,不得用于其他商业用途。
3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!
4、本本所有内容不构成任何投资建议,不具有任何指导和买卖意见。
5、如文档内容存在违规,或者侵犯商业秘密、侵犯著作权等,请点击“违规举报”。
2 ApproachDatasetSampling prop.Epochs Disk sizeOur training approach is similar to the methodsCommonCrawl67.0%1.103.3TBdescribed in previous work (Brown et al.,2020;C415.0%1.06783GBChowdhery et al.,2022),and is inspired by theGithub4.5%0.64328GBChinchilla scaling laws (Hoffmann et al..2022).Wikipedia4.5%2.4583 GBWe train large transformers on a large quantity ofBooks4.5%2.2385 GBtextual data using a standard optimizer.ArXiv2.5%1.0692 GBStackExchange2.0%1.0378 GB2.1 Pre-training DataOur training dataset is a mixture of several sources.Table 1:Pre-training data.Data mixtures used for pre-reported in Table 1,that cover a diverse set of do-training,for each subset we list the sampling proportion,mains.For the most part,we reuse data sourcesnumber of epochs performed on the subset when train-that have been leveraged to train other LLMs,withing on 1.4T tokens,and disk size.The pre-training runson IT tokens have the same sampling proportion.the restriction of only using data that is publiclyavailable,and compatible with open sourcing.Thisleads to the following mixture of data and the per-languages,which use either the Latin or Cyrilliccentage they represent in the training set:scripts:bg,ca,cs,da,de,en,es,fr,hr,hu,it,English CommonCrawl [67%].We preprocessnl,pl,pt,ro,ru,s1,sr,sv,uk.We process thefive CommonCrawl dumps,ranging from 2017data to remove hyperlinks,comments and otherto 2020,with the CCNet pipeline (Wenzek et al.,formatting boilerplate.2020).This process deduplicates the data at theGutenberg and Books3 [4.5%].We include twoline level,perfomms language identification withbook corpora in our training dataset:the Guten-a fastText linear classifier to remove non-Englishberg Project,which contains books that are in thepages and filters low quality content with an n-public domain,and the Books3 section of TheP-gram language model.In addition,we trained aile (Gao et al.,2020),a publicly available datasetlinear model to classify pages used as referencesfor training large language models.We performin Wikipedia v.s.randomly sampled pages,anddeduplication at the book level,removing booksdiscarded pages not classified as references.with more than 90%content overlap.C4 [15%].During exploratory experiments,weobserved that using diverse pre-processed Com-ArXiv [2.5%].We process arXiv Latex filesmonCrawl datasets improves performance.We thusto add scientific data to our dataset.Followingincluded the publicly available C4 dataset(RaffelLewkowycz et al.(2022),we removed everythinget al.,2020)in our data.The preprocessing of C4before the first section,as well as the bibliography.We also removed the comments from the .tex files,al so contains deduplication and language identifi-cation steps:the main difference with CCNet isand inline-expanded definitions and macros writtenthe quality filtering,which mostly relies on heuris-by users to increase consistency across papers.tics such as presence of punctuation marks or theStack Exchange [2%].We include a dump ofnumber of words and sentences in a webpage.Stack Exchange,a website of high quality ques-Github [4.5%]We use the public GitHubtions and answers that covers a diverse set of do-dataset available on Google BigQuery.We onlymains,ranging from computer science to chemistry.kept projects that are distributed under the Apache,We kept the data from the 28 largest websites,re-BSD and MIT licenses.Additionally,we filteredmoved the HTML tags from text and sorted thelow quality files with heuristics based on the lineanswers by score (from highest to lowest).length or proportion of alphanumeric characters,and removed boilerplate,such as headers,with reg-Tokenizer.We tokenize the data with the byte-ular expressions.Finally,we deduplicate the result-pair encoding (BPE)algorithm (Sennrich et al.,ing dataset at the file level,with exact matches.2015),using the implementation from Sentence-Piece (Kudo and Richardson,2018).Notably,weWikipedia [4.5%].We add Wikipedia dumpssplit all numbers into individual digits,and fallbackfrom the June-August 2022 period,covering 20to bytes to decompose unknown UTF-8 characters.