SwissArmyTransformer瑞士军刀工具箱使用手册

2024-07-16 07:01| 来源: 网络整理| 查看: 265

Introduction sat（SwissArmyTransformer）是一个灵活而强大的库，用于开发您自己的Transformer变体。 sat是以“瑞士军刀”命名的，这意味着所有型号（例如BERT、GPT、T5、GLM、CogView、ViT…）共享相同的backone代码，并通过一些超轻量级的mixin满足多种用途。 sat由deepspeed ZeRO和模型并行性提供支持，旨在为大模型（100M\~20B参数）的预训练和微调提供最佳实践。

从 SwissArmyTransformer 0.2.x 迁移到 0.3.x 导入时将包名称从 SwissArmyTransformer 更改为 sat，例如从 sat 导入 get_args。删除脚本中的所有--sandwich-ln，使用layernorm-order='sandwich'。更改顺序 from_pretrained(args, name) => from_pretrained(name, args)。我们可以直接使用 from sat.model import AutoModel;model, args = AutoModel.from_pretrained('roberta-base') 以仅模型模式加载模型，而不是先初始化 sat。安装 pip install SwissArmyTransformer 特征

添加与模型无关的组件，例如前缀调整，只需一行！

前缀调整（或 P 调整）通过在每个注意力层中添加可训练参数来改进微调。使用我们的库可以轻松地将其应用于 GLM 分类（或任何其他）模型。

class ClassificationModel(GLMModel): # can also be BertModel, RobertaModel, etc. def __init__(self, args, transformer=None, **kwargs): super().__init__(args, transformer=transformer, **kwargs) self.add_mixin('classification_head', MLPHeadMixin(args.hidden_size, 2048, 1)) # Arm an arbitrary model with Prefix-tuning with this line! self.add_mixin('prefix-tuning', PrefixTuningMixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.prefix_len))

GPT 和其他自回归模型在训练和推理过程中的行为有所不同。在推理过程中，文本是逐个令牌生成的，我们需要缓存以前的状态以提高效率。使用我们的库，您只需要考虑训练期间的行为（教师强制），并通过添加 mixin 将其转换为缓存的自回归模型：

model, args = AutoModel.from_pretrained('glm-10b-chinese', args) model.add_mixin('auto-regressive', CachedAutoregressiveMixin()) # Generate a sequence with beam search from sat.generation.autoregressive_sampling import filling_sequence from sat.generation.sampling_strategies import BeamSearchStrategy output, *mems = filling_sequence(model, input_seq, batch_size=args.batch_size, strategy=BeamSearchStrategy(args.batch_size))

使用最少的代码构建基于 Transformer 的模型。我们提到了 GLM，它与标准转换器（称为 BaseModel）仅在位置嵌入（和训练损失）上有所不同。我们在编码的时候只需要关注相关的部分就可以了。

扩展整个定义：

class BlockPositionEmbeddingMixin(BaseMixin): # Here define parameters for the mixin def __init__(self, max_sequence_length, hidden_size, init_method_std=0.02): super(BlockPositionEmbeddingMixin, self).__init__() self.max_sequence_length = max_sequence_length self.hidden_size = hidden_size self.block_position_embeddings = torch.nn.Embedding(max_sequence_length, hidden_size) torch.nn.init.normal_(self.block_position_embeddings.weight, mean=0.0, std=init_method_std) # Here define the method for the mixin def position_embedding_forward(self, position_ids, **kwargs): position_ids, block_position_ids = position_ids[:, 0], position_ids[:, 1] position_embeddings = self.transformer.position_embeddings(position_ids) block_position_embeddings = self.block_position_embeddings(block_position_ids) return position_embeddings + block_position_embeddings class GLMModel(BaseModel): def __init__(self, args, transformer=None, parallel_output=True): super().__init__(args, transformer=transformer, parallel_output=parallel_output) self.add_mixin('block_position_embedding', BlockPositionEmbeddingMixin(args.max_sequence_length, args.hidden_size) ) # Add the mixin for GLM

全方位的培训支持。 sat 旨在提供预训练和微调的最佳实践，您只需要完成forward_step 和 create_dataset_function，但可以使用超参数来更改有用的训练配置。通过指定 --num_nodes、--num_gpus 和一个简单的主机文件，将训练扩展到多个 GPU 或节点。 DeepSpeed 和模型并行性。 ZeRO-2 和激活检查点的更好集成。自动扩展和改组训练数据和内存映射。成功支持CogView2和CogVideo的训练。目前唯一支持在 GPU 上微调 T5-10B 的开源代码库。

快速浏览

在 sat 中使用 Bert（用于推理）的最典型的 python 文件如下：

# @File: inference_bert.py from sat import get_args, get_tokenizer, AutoModel # Parse args, initialize the environment. This is necessary. args = get_args() # Automatically download and load model. Will also dump model-related hyperparameters to args. model, args = AutoModel.from_pretrained('bert-base-uncased', args) # Get the BertTokenizer according to args.tokenizer_type (automatically set). tokenizer = get_tokenizer(args) # Here to use bert as you want! # ...

然后我们可以通过以下方式运行代码

SAT_HOME=/path/to/download python inference_bert.py --mode inference

所有官方支持的模型名称都在 urls.py 中。

# @File: finetune_bert.py from sat import get_args, get_tokenizer, AutoModel from sat.model.mixins import MLPHeadMixin def create_dataset_function(path, args): # Here to load the dataset # ... assert isinstance(dataset, torch.utils.data.Dataset) return dataset def forward_step(data_iterator, model, args, timers): inputs = next(data_iterator) # from the dataset of create_dataset_function. loss, *others = model(inputs) return loss # Parse args, initialize the environment. This is necessary. args = get_args() model, args = AutoModel.from_pretrained('bert-base-uncased', args) tokenizer = get_tokenizer(args) # Here to use bert as you want! model.del_mixin('bert-final') model.add_mixin('classification_head', MLPHeadMixin(args.hidden_size, 2048, 1)) # ONE LINE to train! # args already includes hyperparams such as lr, train-iters, zero-stage ... training_main(args, model_cls=model, forward_step_function=forward_step, # user define create_dataset_function=create_dataset_function # user define )

然后我们可以通过以下方式运行代码

deepspeed --include localhost:0,1 finetune_bert.py \ --experiment-name ftbert \ --mode finetune --train-iters 1000 --save /path/to/save \ --train-data /path/to/train --valid-data /path/to/valid \ --lr 0.00002 --batch-size 8 --zero-stage 1 --fp16

这里我们在 GPU 0,1 上使用数据并行。我们还可以通过 --hostfile/path/to/hostfile 在许多互连的机器上启动训练。请参阅教程了解更多详细信息。要编写自己的模型，您只需要考虑与标准 Transformer 的差异。例如，如果你有一个改进注意力操作的想法：

from sat.model import BaseMixin class MyAttention(BaseMixin): def __init__(self, hidden_size): super(MyAttention, self).__init__() # MyAttention may needs some new params, e.g. a learnable alpha. self.learnable_alpha = torch.nn.Parameter(torch.ones(hidden_size)) # This is a hook function, the name `attention_fn` is special. def attention_fn(q, k, v, mask, dropout=None, **kwargs): # Code for my attention. # ... return attention_results

这里的attention_fn是一个钩子函数，用新函数替换默认动作。所有可用的钩子都在transformer_defaults.py中。现在我们可以使用 add_mixin 将更改应用到所有转换器，例如 BERT、Vit 和 CogView。请参阅教程了解更多详细信息。

教程 How to use pretrained models collected in sat?Why and how to train models in sat? Citation

Currently we don't have a paper, so you don't need to formally cite us!~

If this project helps your research or engineering, use \footnote{https://github.com/THUDM/SwissArmyTransformer} to mention us and recommend SwissArmyTransformer to others.

The tutorial for contributing sat is on the way!

The project is based on (a user of) DeepSpeed, Megatron-LM and Huggingface transformers. Thanks for their awesome work.

训练指导 The Training API

我们提供了一个简单但功能强大的训练APItraining_main()，它不仅限于我们的Transformer模型，还适用于任何torch.nn.Module。

from sat import get_args, training_main from sat.model import AutoModel, BaseModel args = get_args() # to pretrain from scratch, give a class obj model = BaseModel # to finetuned from a given model, give a torch.nn.Module model = AutoModel.from_pretrained('bert-base-uncased', args) training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=dataset_func, handle_metrics_function=None, init_function=None )

以上是使用 sat 的标准训练计划的（不完整）示例。 Training_main 接受 5 个参数：（必需）model_cls：继承 torch.nn.Module 的类型对象，或我们训练的 torch.nn.Module 对象。（必需）forward_step_function：一个自定义函数，输入 data_iterator、model、args、timers、returns loss、{'metric0': m0, ...}。（必填）create_dataset_function：返回一个torch.utils.data.Dataset用于加载。我们的库会自动将数据分配给多个worker，并将数据迭代器交给forward_step_function。（可选）handle_metrics_function：在评估过程中处理特殊指标。（可选）init_function：在训练之前更改模型的钩子，对于继续训练很有用。有关完整示例，请参阅 Finetune BERT 示例。

【本文地址】

公司简介

联系我们