【论文翻译】LLaMA: Open and Efficient Foundation Language Models

您所在的位置：网站首页 › token翻译 › 【论文翻译】LLaMA: Open and Efficient Foundation Language Models

【论文翻译】LLaMA: Open and Efficient Foundation Language Models

2023-03-30 08:20| 来源: 网络整理| 查看: 265

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

LLaMA，这是一组基础语言模型，参数范围从 7B 到 65B。

我们在数万亿个Token上训练我们的模型，并表明可以仅使用公开可用的数据集来训练最先进的模型，而无需诉诸专有和不可访问的数据集。

特别是，LLaMA-13B 在大多数基准测试中都优于 GPT-3 (175B)，而 LLaMA-65B 可与最佳模型 Chinchilla-70B 和 PaLM-540B 竞争。我们将所有模型发布给研究社区.

1. IntroductionLarge Languages Models (LLMs) trained on massive corpora of texts have shown their ability to perform new tasks from textual instructions or from a few examples (Brown et al., 2020). These few-shot properties first appeared when scaling models to a sufficient size (Kaplan et al., 2020), resulting in a line of work that focuses on further scaling these models (Chowdhery et al., 2022; Rae et al., 2021).These efforts are based on the assumption that more parameters will lead to better performance.However, recent work from Hoffmann et al. (2022) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data.

在大量文本语料库上训练的大型语言模型 (LLM) 已经显示出它们能够根据文本指令或一些示例执行新任务。这些小样本属性首先出现在将模型缩放到足够大时，从而产生了一系列专注于进一步缩放这些模型的工作.

这些努力是基于这样的假设，即更多的参数将导致更好的性能。

然而，霍夫曼等人最近的工作表明，对于给定的计算预算，最佳性能不是由最大的模型实现的，而是由在更多数据上训练的较小模型实现的。

The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale.In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Hoffmann 的缩放定律的目标是确定如何针对特定训练计算预算最好地扩展数据集和模型大小。然而，这个目标忽略了推理预算，这在大规模服务语言模型时变得至关重要。

在这种情况下，给定目标性能水平，首选模型不是训练最快的模型，而是推理速度最快的模型，虽然训练大型模型以达到特定性能水平可能更便宜，但训练时间更短的较小模型最终会在推理上更便宜。

例如，虽然 Hoffmann 建议在 200B Token上训练 10B 模型，我们发现即使在 1T Token之后，7B 模型的性能仍在继续提高。

The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA, ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs. For instance, LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU.At the higher-end of the scale, our 65B-parameter model is also competitive with the best large language models such as Chinchilla or PaLM-540B.

这项工作的重点是训练一系列语言模型，通过训练比通常使用的更多的标记，在各种推理预算下实现最佳性能。生成的模型称为 LLaMA，参数范围从 7B 到 65B，与现有最好的 LLM 相比具有竞争力的性能。

例如，尽管 LLaMA-13B 小 10 倍，但在大多数基准测试中都优于 GPT-3。

我们相信该模型将有助于使 LLM 的访问和研究大众化，因为它可以在单个 GPU 上运行。

在高级别规模上，我们的 65B 参数模型也可以与最好的大型语言模型（如 Chinchilla 或 PaLM-540B）竞争。

Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”). There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla

与 Chinchilla、PaLM 或 GPT-3 不同，我们只使用公开可用的数据，使我们的工作与开源兼容，而大多数现有模型依赖于非公开可用或未记录的数据（例如“Books – 2TB”或“ 社交媒体对话”）。

存在一些例外，特别是 OP、GPT-NeoX、BLOOM和 GLM，但没有与 PaLM-62B 或 Chinchilla 竞争。

In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method. We then report the performance of our models and compare with others LLMs on a set of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community.

在本文的其余部分，我们概述了我们对变压器架构所做的修改，以及我们的训练方法。

然后我们报告我们模型的性能，并在一组标准基准上与其他 LLM 进行比较。

最后，我们使用来自负责任的 AI 社区的一些最新基准，揭示了我们模型中编码的一些偏见和毒性。

2. ApproachOur training approach is similar to the methods described in previous work (Brown et al., 2020;Chowdhery et al., 2022), and is inspired by the Chinchilla scaling laws (Hoffmann et al., 2022).We train large transformers on a large quantity of textual data using a standard optimizer.

我们的训练方法类似于之前工作中描述的方法 (Brown et al., 2020;Chowdhery et al., 2022)，并受到 Chinchilla 缩放定律 (Hoffmann et al., 2022) 的启发。

我们使用标准优化器在大量文本数据上训练大型转换器。

2.1 Pre-training DataOur training dataset is a mixture of several sources, reported in Table 1, that cover a diverse set of domains. For the most part, we reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing. This leads to the following mixture of data and the percentage they represent in the training set:

我们的训练数据集是多个来源的混合，如表 1 所示，涵盖了不同的领域。

在大多数情况下，我们重复使用已用于训练其他 LLM 的数据源，但仅限于使用公开可用且与开源兼容的数据。

这导致以下混合数据及其在训练集中所占的百分比：

English CommonCrawl [67%]

We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al.,2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.

我们使用 CCNet 管道对 2017 年至 2020 年的五个普通抓取数据进行了预处理。

此过程在行级别删除重复数据，使用 fastText 线性分类器执行语言识别以删除非英语页面并使用 n-gram 语言模型过滤低质量内容。

此外，我们训练了一个线性模型来对在维基百科和维基百科中用作参考的页面进行分类。随机抽取的页面，以及未分类为参考的丢弃页面。

C4 [15%]

During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data. The preprocessing of C4 also contains deduplication and language identification steps: the main difference with CCNet is the quality filtering, which mostly relies on heuristics such as presence of punctuation marks or the number of words and sentences in a webpage.

在探索性实验中，我们观察到使用不同的预处理普通抓取数据集可以提高性能。

因此，我们在数据中包含了公开可用的 C4 数据集。

C4 的预处理还包含去重和语言识别步骤：与 CCNet 的主要区别在于质量过滤，它主要依赖于启发式方法，例如网页中标点符号的存在或单词和句子的数量。

Github [4.5%]

We use the public GitHub dataset available on Google BigQuery. We only kept projects that are distributed under the Apache, BSD and MIT licenses. Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with regular expressions. Finally, we deduplicate the resulting dataset at the file level, with exact matches.

我们使用 Google BigQuery 上可用的公共 GitHub 数据集。

我们只保留在 Apache、BSD 和 MIT 许可证下分发的项目。

此外，我们还根据行长或字母数字字符的比例使用启发式方法过滤了低质量文件，并使用正则表达式删除了样板文件，例如标题。

最后，我们在文件级别对生成的数据集进行重复数据删除，并进行精确匹配。

Wikipedia [4.5%]

We add Wikipedia dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the data to remove hyperlinks, comments and other formatting boilerplate.

我们添加了 2022 年 6 月至 8 月期间的维基百科转储，涵盖 20 种语言，使用拉丁文或西里尔文脚本：bg、ca、cs、da、de、en、es、fr、hr、hu、it、nl、pl , pt, ro, ru, sl, sr, sv, 英国。我们处理数据以删除超链接、评论和其他格式样板。

Gutenberg and Books3 [4.5%]

We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models. We perform deduplication at the book level, removing books with more than 90% content overlap.

我们在训练数据集中包含两个图书语料库：古腾堡项目，其中包含公共领域的书籍，以及 ThePile 的 Books3 部分，这是一个用于训练大型语言模型的公开数据集。

我们在图书层面进行去重，去除内容重叠超过 90% 的图书。

ArXiv [2.5%]

We process arXiv Latex files to add scientific data to our dataset. Following Lewkowycz et al. (2022), we removed everything before the first section, as well as the bibliography.We also removed the comments from the .tex files, and inline-expanded definitions and macros written by users to increase consistency across papers.

我们处理 arXiv Latex 文件以将科学数据添加到我们的数据集中。

继 Lewkowycz 之后，我们删除了第一部分之前的所有内容，以及参考书目。

我们还删除了 .tex 文件中的注释，以及用户编写的内联扩展定义和宏，以提高论文之间的一致性。

Stack Exchange [2%]

We include a dump of Stack Exchange, a website of high quality questions and answers that covers a diverse set of domains, ranging from computer science to chemistry.We kept the data from the 28 largest websites, removed the HTML tags from text and sorted the answers by score (from highest to lowest).

我们包括一个批Stack Exchange 的抓取数据，这是一个提供高质量问题和答案的网站，涵盖从计算机科学到化学的各种领域。

我们保留了 28 个最大网站的数据，从文本中删除了 HTML 标签，并按分数（从最高到最低）对答案进行了排序。

Tokenizer

We tokenize the data with the byte-pair encoding (BPE) algorithm (Sennrich et al., 2015), using the implementation from Sentence Piece (Kudo and Richardson, 2018). Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

我们使用字节对编码 (BPE) 算法对数据进行标记，使用 Sentence Piece 的实现（Kudo 和 Richardson，2018 年）。

值得注意的是，我们将所有数字拆分为单独的数字，并回退到字节以分解未知的 UTF-8 字符。

Overall, our entire training dataset contains roughly 1.4T tokens after tokenization. For most of our training data, each token is used only once during training, with the exception of the Wikipedia and Books domains, over which we perform approximately two epochs.

总体而言，我们的整个训练数据集在标记化后包含大约 1.4T 标记。

对于我们的大部分训练数据，每个标记在训练期间仅使用一次，除了维基百科和图书领域，我们在这些领域执行了大约两个时期。

2.2 Architecture Following recent work on large language models, our network is based on the transformer architecture (Vaswani et al., 2017). We leverage various improvements that were subsequently proposed, and used in different models such as PaLM. Here are the main difference with the original architecture, and where we were found the inspiration for this change (in bracket):

在最近对大型语言模型的研究之后，我们的网络基于转换器架构。

我们利用了随后提出的各种改进，并用于不同的模型，例如 PaLM。

以下是与原始架构的主要区别，以及我们从哪里找到了这种变化的灵感（在括号中）：

Pre-normalization [GPT3].

To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing function, introduced by Zhang and Sennrich (2019).

为了提高训练稳定性，我们对每个变换器子层的输入进行归一化，而不是对输出进行归一化。

我们使用 Zhang 和 Sennrich (2019) 介绍的 RMSNorm 归一化函数。

SwiGLU activation function [PaLM].

We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of instead of \frac{2}{3}4d as in PaLM.

我们用 Shazeer (2020) 引入的 SwiGLU 激活函数代替 ReLU 非线性，以提高性能。

我们使用 \frac{2}{3}4d 维度而不是 PaLM 中的 4d。

Rotary Embeddings [GPTNeo].

We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by Su et al. (2021), at each layer of the network.

我们删除了绝对位置嵌入，取而代之的是在网络的每一层中添加了由 Su 介绍的旋转位置嵌入。

The details of the hyper-parameters for our different models are given in Table 2.

表 2 给出了我们不同模型的超参数的详细信息。

表22.3 OptimizerOur models are trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with the following hyper-parameters: β1 = 0.9, β2 = 0.95.We use a cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. We use 2, 000 warmup steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).

我们的模型使用 AdamW 优化器（Loshchilov 和 Hutter，2017）进行训练，具有以下超参数：β1 = 0.9，β2 = 0.95。

我们使用余弦学习率计划，使最终学习率等于最大学习率的 10%。

我们使用 0.1 的权重衰减和 1.0 的梯度裁剪。我们使用 2, 000 个预热步骤，并根据模型的大小改变学习率和批量大小（详见表 2）。

2.4 Efficient implementationWe make several optimizations to improve the training speed of our models. First, we use an efficient implementation of the causal multi-head attention operator, inspired by Rabe and Staats (2021) and Dao et al. (2022). This implementation, available in the xformers library, reduces the memory usage and computation. This is achieved by not storing the attention weights and not computing the key/query scores that are masked due to the causal nature of the language modeling task.

我们进行了多项优化以提高模型的训练速度。

首先，我们受 Rabe 和 Staats 以及 Dao 等人的启发，使用因果多头注意力算子的高效实现。

这个实现在 xformers 库中可用，减少了内存使用和计算。

这是通过不存储注意力权重并且不计算由于语言建模任务的因果性质而被屏蔽的键/查询分数来实现的。

To further improve training efficiency, we reduced the amount of activations that are recomputed during the backward pass with checkpointing. More precisely, we save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually implementing the backward function for the transformer layers, instead of relying on the PyTorch autograd.To fully benefit from this optimization, we need to reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022). Moreover, we also overlap the computation of activations and the communication between GPUs over the network (due to all_reduce operations) as much as possible.

为了进一步提高训练效率，我们减少了在带有检查点的反向传播过程中重新计算的激活量。

更准确地说，我们保存了计算成本高昂的激活，例如线性层的输出。

这是通过手动实现变换器层的反向功能来实现的，而不是依赖于 PyTorch autograd。

为了充分受益于这种优化，我们需要通过使用模型和序列并行性来减少模型的内存使用，如 Korthikanti 等人所述。

此外，我们还尽可能多地重叠激活计算和 GPU 之间的网络通信（由于 all_reduce 操作）。

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

在训练 65B 参数模型时，我们的代码在具有 80GB RAM 的 2048 A100 GPU 上处理大约 380 个Token/秒/GPU。

这意味着对包含 1.4T Token的数据集进行训练大约需要 21 天。

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章