搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十四）

您所在的位置：网站首页 › resnet50参数量 › 搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十四）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十四）

#搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十四）| 来源: 网络整理| 查看: 265

↑ 点击蓝字关注极市平台

作者丨科技猛兽

编辑丨极市平台

极市导读

本文介绍2篇文章：T2T-ViT、VOLO都是来自新加坡国立大学冯佳时老师课题组和依图科技颜水成老师团队的视觉 Transformer 的经典工作，在业界也备受瞩目。 >> 加入极市CV技术交流群，走在计算机视觉的最前沿

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（一）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（二）

搞懂Vision Transformer 原理和代码，看这篇技术综述就够了(三)

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（四）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（五）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（六）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（七）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（八）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（九）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十一）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十二）

搞懂 Vision Transformer 原理和代码，看这篇技术综述就够了（十三）

本文目录

31 T2T-ViT：在ImageNet上从头训练视觉Transformer

(来自新加坡国立大学冯佳时团队，依图科技颜水成团队)

31.1 T2T-ViT原理分析

31.2 T2T-ViT代码解读

32 VOLO刷新CV多项记录，无需额外训练数据，首次在ImageNet 上达到87.1%

(来自新加坡国立大学冯佳时团队，依图科技颜水成团队)

32.1 VOLO原理分析

32.2 VOLO代码解读

Transformer 是 Google 的团队在 2017 年提出的一种 NLP 经典模型，现在比较火热的 Bert 也是基于 Transformer。Transformer 模型使用了 Self-Attention 机制，不采用 RNN 的顺序结构，使得模型可以并行化训练，而且能够拥有全局信息。

本文介绍2篇文章都是来自新加坡国立大学冯佳时老师课题组和依图科技颜水成老师团队的视觉 Transformer 的经典工作，在业界也备受瞩目。

31 T2T-ViT：在ImageNet上从头训练视觉Transformer 论文名称：Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet 论文地址：

https://arxiv.org/pdf/2101.11986.pdf

31.1 T2T-ViT原理分析：

在 T2T-ViT 这篇文章中，作者认为，使用中等大小的数据集 (如 ImageNet) 训练时，目前视觉 Transformer 的性能相比于很普通的 CNN 模型 (比如 ResNet) 更低的原因有2点：

ViT 处理图片的方式不够好，无法建模一张图片的局部信息。 ViT 的自注意力机制的 Backbone 不如 CNN 设计的好。 1) ViT 处理图片的方式不够好

ViT 首先把的图片，变成一个的序列。它可以看做是一系列的展平的2D块的序列，这个序列中一共有个展平的2D块，每个块的维度是。其中是块大小，是channel数。

注意作者做这步变化的意图：Transformer希望输入一个二维的矩阵，其中是sequence的长度，是sequence的每个向量的维度，常用 {Base: 768, Small: 384, Tiny: 192}。

所以这里是要把的三维图片转化成的二维输入。

所以有：

其中，是Transformer输入的sequence的长度。

这步的代码是 (来自 timm 库 https://zhuanlan.zhihu.com/p/350837279 )：

class PatchEmbed(nn.Module): """ Image to Patch Embedding """ def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768): super().__init__() img_size = to_2tuple(img_size) patch_size = to_2tuple(patch_size) num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) self.img_size = img_size self.patch_size = patch_size self.num_patches = num_patches self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size) def forward(self, x): B, C, H, W = x.shape # FIXME look at relaxing size constraints assert H == self.img_size[0] and W == self.img_size[1], \ f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})." x = self.proj(x).flatten(2).transpose(1, 2) return x

上述就是 ViT 的做法，分块操作的核心就是这句话：

self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)

这是一个的卷积操作，输入 channel 数是3，输出 channel 数是embed_dim=768/384/192，我们称之为 patchify stem 。

以上是 ViT 处理图片的方式，其实就是一个卷积操作！所以根本无法建模一张 image 的局部信息 (比如图片中物体的边，角等等)，所以需要大量的数据集。

2) ViT 的自注意力机制的 Backbone 不如 CNN 设计的好

因为自注意力机制的 Backbone 一开始不是为 CV 任务设计的，所以包含冗余的特征，导致特征的丰富程度有限，模型训练困难。

作者为了证实这2个论点，可视化了 ViT，ResNet50，T2T-ViT模型的第1层，中间层，和最后一层的特征，如下图1-图3所示。

我们看到图1代表三种网络的浅层特征，ResNet50可以较好地学习到浅层的边，角，纹理等信息，图2代表三种网络的浅层特征，ResNet50可以较好地学习到相对更复杂的整体结构信息。而 ViT 的表现则不尽相同，浅层特征和中层特征的冗余度很高，很多 channels 的可视化结果十分相似，并且缺乏边，角，纹理等信息。而且，作者发现 ViT 中的许多 channels 具有零值 (图2中用红色突出显示)，这意味着ViT 的主干网络不如 ResNet50 有效，并且在训练样本不足时提供的特征丰富度是极为有限的。

图1：ViT，ResNet，T2T-ViT模型的首层特征可视化

图2：ViT，ResNet，T2T-ViT模型的中间层特征可视化

图3：ViT，ResNet，T2T-ViT模型的尾层特征可视化

T2T-ViT 就是为了解决这2个问题提出了2个改进。为了建模一张图片的局部信息，作者设计了 Tokens-to-Token module 。它可以有效建模每个 token 及其邻近 tokens 的 relationship。为了找到更高效的 Transformer Backbone，作者借鉴了 CNN 的设计，发现在相近的计算量之下，"Deep and Narrow" 的架构可以得到更好的性能。下面是对这2个改进的详细介绍。

Tokens-to-Token 模块

如下图4所示是Tokens-to-Token 模块的示意，它由2个部分组成： Restructurization 和 Soft Split (SS) 。

Step1：Restructurization

Tokens-to-Token 模块的输入是上文介绍的image patches 序列，维度是，这个序列通过一个 Transformer Block，这里的 Transformer Block 可以是诸如 ViT 里面的一个 Block，或者是像 Performer Block 的结构。这里假设 Transformer Block 就是 ViT 的 Block 结构，那么有：

式中代表 Multi-head Self-attention，代表 Multi-layer Perceptron。

图4：Tokens-to-Token 模块图解

假设张量的维度是，接下来把张量进行 Reshape操作：

具体而言就是把维度是的张量 Reshape 成为，且有。

Step2：Soft Split (SS)

在得到新的张量以后，下面要进行第2步 Soft Split 操作。其实这一步很好理解，在 step1 里面我们把张量重新 Reshape 成了三维的形状，就是为了这一步提取 local information 更方便，所以这一步就是为了建模 local information 的。

如图4的蓝色框和红色框所示，把新的张量再分成一个个的 patches，这些 patches 还是有重叠的。这样每个 patch 就又与周围的 patch 有了联系，而且这种关系很像 CNN 里面的归纳偏置。现在，把每个 patch (比如红色框或者蓝色框) 给 Unfold 成为一个新的 token 。

不理解 nn.Fold 操作的读者可以参考：

Unfold - PyTorch 1.9.0 documentation

https://pytorch.org/docs/stable/generated/torch.nn.Unfold.html

或者我在这里通俗地解释下 nn.Fold 操作：假设我们有个张量维度是，Unfold 操作需要一个的卷积核作用在这个张量上面，它会像卷积操作一样每次取一个的块，然后拉直成的块。这样一来我们得到的输出张量维度是多少呢？答案是：。

这里的。

式中，分别是 nn.Fold 操作的 stride 参数和 padding 参数。

现在你应该理解了 nn.Fold 操作，T2T 模块也是对使用了 nn.Fold 操作，所以输出张量的维度是：。

式中，

这里是 stride 参数，为 patch 之间重叠的距离。

以上就是 T2T 模块的流程，写成公式就是：

T2T 模块相比于常规的 ViT 的一个 Block 来讲，相当于多了一个 Reshape 操作和一个 Unfold 操作。 Reshape 操作把张量变成三维的样子，方便我们提取 local 的信息，Unfold 操作提取局部信息并且顺带再把张量转换回二维的格式，方便 Transformer Block 的建模。

对于一张输入图片，首先通过 Soft Split (SS) 操作把它变成 tokens：，而且经过 T2T 模块之后，输出的 token 的序列长度是固定的。这个长度相比于 ViT 的194 是很大的，所以为了减小现存的占用和计算量，作者把 T2T 模型的embedding dimension 设置的很小，一般是 32 或 64。而在 ViT 中用 {Base: 768, Small: 384, Tiny: 192}。

T2T-ViT Backbone

T2T-ViT Backbone 所解决的问题是 ViT 模型的许多 channels 都是冗余的，为了设计一种更高效的 Backbone，同时增加 feature map 的丰富性，作者借鉴了一些 CNN 的 Backbone 架构设计方案，每个 Transformer Block 都有残差链接，这一点和 ResNet 比较相像，所以作者借鉴了5种 CNN 的 Backbone：

借鉴 DenseNet：使用 Dense 连接。

借鉴 Wide-ResNets：Deep-narrow vs. shallow-wide 结构对比。

借鉴 SE 模块：使用 Channel attention 结构。

借鉴 ResNeXt：在注意力机制中使用更多的 heads。

借鉴 GhostNet：使用 Ghost 模块。

经过比较作者得出了2个结论：

使用 Deep-narrow 架构，并减少 embedding dimension 更适合视觉 Transformer，可以增加特征的丰富程度。同时减少 embedding dimension 也可以降低计算量。

SE 模块的 Channel attention 结构也可以提升 ViT 的性能，但是效果不如前者。

根据以上结论，作者设计了一个 Deep-narrow 架构的 T2T Backbone，它的 embedding dimension 比较小，同时层数较多，如下图5所示。

图5：T2T-ViT 架构

如上图5所示是 T2T-ViT 的架构，首先是通过 T2T 模块对 images 的局部信息进行建模得到输出，再通过 T2T-ViT 的 Backbone。

式中，代表正余弦位置编码，代表 class token 对应的输出通过全连接层。

从图5中我们看到，T2T 模块只有2层，按照前面的描述，这就意味着有 3次 Soft Split (SS) 操作和 2次 Restructurization 操作。3次 Soft Split (SS) 操作会有3次 torch.unfold 操作，使用的卷积核的大小分别是，patches 之间重叠的大小分别是。也就是说 stride 的大小分别是。这样一来，T2T 模块就会把 224×224 大小的图片变成 14×14 大小。

接下来 T2T 模块的输出张量进入 T2T Backbone 里面， T2T Backbone 有14层 Block，embedding dimension 大小是384，对比 ViT-B/16 有12层，embedding dimension 大小是768。作者为了方便相似参数模型的对比，设计了5种模型：T2T-ViT-14, T2T-ViT-19 和T2T-ViT-24，其参数量分别与 ResNet50, ResNet101 和 ResNet152 是相当的。T2T-ViT-7, T2T-ViT-12，其参数量分别与 MobileNetV1和 V2 是相当的。

Experiments：实验1：与 ViT，ResNet，MobileNet 模型的对比

如下图6所示的结果为直接在 ImageNet 数据集上训练的 ViT 和 T2T-ViT 模型性能的对比。T2T-ViT 模型比传统 ViT 模型更小，且性能更好。相比于48.6M参数的ViT-S/16，T2T-ViT-14 只有它的 44.2% 的参数量和 51.5% 的计算量，但是性能更优。

图6：与 ViT 的对比

如下图7所示的结果为直接在 ImageNet 数据集上训练的 ViT 和 ResNet 模型性能的对比。在相似的模型大小的情况下，T2T-ViT 模型性能超越了 ResNet。相比于25.5M参数的ResNet-50，T2T-ViT-14 只有 21.5M 的参数量和 5.2G 的计算量，但是性能更优。

图7：与 ResNet 的对比

如下图8所示的结果为直接在 ImageNet 数据集上训练的 ViT 和 MobileNet 模型性能的对比。在相似的模型大小的情况下，T2T-ViT 模型性能超越了 MobileNet。相比于6.9M参数的MobileNet V2 1.4×，T2T-ViT-12 只有 6.9M 的参数量但是性能达到了76.5%。

图8：与 MobileNet 的对比实验2：T2T 使用各种 Efficient Backbone 性能的对比

作者比较了5种 T2T Backbone：

借鉴 DenseNet：使用 Dense 连接。

借鉴 Wide-ResNets：Deep-narrow vs. shallow-wide 结构对比。

借鉴 SE 模块：使用 Channel attention 结构。

借鉴 ResNeXt：在注意力机制中使用更多的 heads。

借鉴 GhostNet：使用 Ghost 模块。

结果如下图9所示，不同的 Efficient Backbone 使用不同的颜色表示。我们可以发现 SE 模块 (ViT-SE) 和 Deep-narrow 结构 (ViT-DN) 都有利于 ViT 性能的提升，但最有效的结构是 Deep-narrow 结构，它将模型尺寸和MACs减少了近2倍，并在基线模型 ViT-S/16 上带来了0.9% 的精度提升。作者进一步将 CNN 的这些结构应用到 T2T-ViT 中，观察到的现象以及得出的结论是：

1) Deep-narrow 比 shallow-wide 结构对 T2T-ViT 更有利

ViT-DN 的 embedding dimension=384，一共16层，ViT-SW embedding dimension=1042，一共4层。可以看到相比于基线模型，ViT-SW 性能下降了8.2%，而ViT-DN 性能提升了0.9%。这些结果验证了我们的假设，即 shallow-wide 结构的 ViT 在通道维度上是冗余的，并且在浅层特征丰富度有限。

2) 密集连接会损害 ViT 和 T2T-ViT 的性能 3) SE 模块会提升 ViT 和 T2T-ViT 的性能

意味着将对 channel 应用注意力机制对 CNN 和 ViT 模型都有好处。

4) ResNeXt 结构对ViT 和 T2T-ViT影响不大

ResNeXt 在 Resnet 上采用多头，而 Transformers 也是多头注意力结构。当我们采用更多像32这样的多头，我们可以发现它对性能的影响很小。然而，采用大量头会使GPU内存变大，因此在 ViT 和 T2T-ViT 中是不必要的。

5) Ghost 模块可以进一步压缩模型

使用 Ghost 模块压缩模型，带来的性能损失对于 ResNet 来讲更严重，但对于 ViT 模型来讲影响小于 ResNet。

图9：T2T 使用各种 Efficient Backbone 性能的对比 31.2 T2T-ViT代码解读：代码来自：

yitu-opensource/T2T-ViT

https://github.com/yitu-opensource/T2T-ViT

这份代码基于 timm 库，Py T orch Im age M odels，简称timm，是一个巨大的 PyTorch 代码集合，也是一个应用广泛的 PyTorch 视觉模型框架，链接如下：

科技猛兽：视觉Transformer优秀开源工作：timm库vision transformer代码解读

https://zhuanlan.zhihu.com/p/350837279

先来看下用法： 1 首先要准备好 ImageNet 数据集，存放在服务器某个文件夹下，样子如下： │imagenet/ ├──train/ │ ├── n01440764 │ │ ├── n01440764_10026.JPEG │ │ ├── n01440764_10027.JPEG │ │ ├── ...... │ ├── ...... ├──val/ │ ├── n01440764 │ │ ├── ILSVRC2012_val_00000293.JPEG │ │ ├── ILSVRC2012_val_00002138.JPEG │ │ ├── ...... │ ├── ...... 2 在Readme 文件中作者提供了许多 Pretrained model 的下载链接，下载以后使用方法如下：

假设要使用 t2t_vit_14 的权重。

from models.t2t_vit import * from utils import load_for_transfer_learning # create model model = t2t_vit_14() # load the pretrained weights load_for_transfer_learning(model, /path/to/pretrained/weights, use_ema=True, strict=False, num_classes=100) # change num_classes based on dataset, can work for different image size as we interpolate the position embeding for different image size. 3 验证模型性能：

验证 T2T-ViT-7，图片尺寸 224×224。

CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model t2t_vit_7 -b 100 --eval_checkpoint path/to/checkpoint

验证 T2T-ViT-14，图片尺寸 384×384。

CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model t2t_vit_14 --img-size 384 -b 100 --eval_checkpoint path/to/T2T-ViT-14-384 4 训练模型：

8 卡训练 T2T-ViT-7。

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model t2t_vit_7 -b 64 --lr 1e-3 --weight-decay .03 --amp --img-size 224 5 在小数据集上迁移学习：

预训练好的 T2T-ViT-19 模型迁移到 CIFAR10。

CUDA_VISIBLE_DEVICES=0,1 transfer_learning.py --lr 0.05 --b 64 --num-classes 10 --img-size 224 --transfer-learning True --transfer-model /path/to/pretrained/T2T-ViT-19 代码解读： 1 T2T 模块的 Token Transformer Block 和 Token Performer Block：

文件 token_transformer.py 和 token_performer.py 里面存的是一个 Transformer Block 或者一个 Performer Block。分别用类 class Token_transformer 或者 class Token_performer 来表示。

这部分代码对应着图4中的 T2T Transformer，是 T2T 模块的其中一个环节。

2 T2T 模块其余操作：

T2T 模块中的 T2T Transformer 可以选择 ViT Block 或者 Performer Block，假设选择 ViT Block，代码里面定义了3个 self.soft_split 操作和2个 self.attention 操作。

在 forward() 函数里面，依次通过 self.soft_split0，self.attention1，self.soft_split1，self.attention2，self.soft_split3，并不断转换维度，具体的维度大小我注释在了代码里面。

class T2T_module(nn.Module): """ Tokens-to-Token encoding module """ def __init__(self, img_size=224, tokens_type='performer', in_chans=3, embed_dim=768, token_dim=64): super().__init__() if tokens_type == 'transformer': print('adopt transformer encoder for tokens-to-token') self.soft_split0 = nn.Unfold(kernel_size=(7, 7), stride=(4, 4), padding=(2, 2)) self.soft_split1 = nn.Unfold(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) self.soft_split2 = nn.Unfold(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) self.attention1 = Token_transformer(dim=in_chans * 7 * 7, in_dim=token_dim, num_heads=1, mlp_ratio=1.0) self.attention2 = Token_transformer(dim=token_dim * 3 * 3, in_dim=token_dim, num_heads=1, mlp_ratio=1.0) self.project = nn.Linear(token_dim * 3 * 3, embed_dim) elif tokens_type == 'performer': print('adopt performer encoder for tokens-to-token') self.soft_split0 = nn.Unfold(kernel_size=(7, 7), stride=(4, 4), padding=(2, 2)) self.soft_split1 = nn.Unfold(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) self.soft_split2 = nn.Unfold(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) #self.attention1 = Token_performer(dim=token_dim, in_dim=in_chans*7*7, kernel_ratio=0.5) #self.attention2 = Token_performer(dim=token_dim, in_dim=token_dim*3*3, kernel_ratio=0.5) self.attention1 = Token_performer(dim=in_chans*7*7, in_dim=token_dim, kernel_ratio=0.5) self.attention2 = Token_performer(dim=token_dim*3*3, in_dim=token_dim, kernel_ratio=0.5) self.project = nn.Linear(token_dim * 3 * 3, embed_dim) elif tokens_type == 'convolution': # just for comparison with conolution, not our model # for this tokens type, you need change forward as three convolution operation print('adopt convolution layers for tokens-to-token') self.soft_split0 = nn.Conv2d(3, token_dim, kernel_size=(7, 7), stride=(4, 4), padding=(2, 2)) # the 1st convolution self.soft_split1 = nn.Conv2d(token_dim, token_dim, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) # the 2nd convolution self.project = nn.Conv2d(token_dim, embed_dim, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1)) # the 3rd convolution self.num_patches = (img_size // (4 * 2 * 2)) * (img_size // (4 * 2 * 2)) # there are 3 sfot split, stride are 4,2,2 seperately def forward(self, x): # step0: soft split # x: (B, 56*56, c*49) x = self.soft_split0(x).transpose(1, 2) # iteration1: re-structurization/reconstruction # x: (B, 56*56, 64) x = self.attention1(x) B, new_HW, C = x.shape # x: (B, 64, 56, 56) x = x.transpose(1,2).reshape(B, C, int(np.sqrt(new_HW)), int(np.sqrt(new_HW))) # iteration1: soft split # x: (B, 28*28, 64*9) x = self.soft_split1(x).transpose(1, 2) # iteration2: re-structurization/reconstruction # x: (B, 28*28, 64) x = self.attention2(x) B, new_HW, C = x.shape # x: (B, 64, 28, 28) x = x.transpose(1, 2).reshape(B, C, int(np.sqrt(new_HW)), int(np.sqrt(new_HW))) # iteration2: soft split # x: (B, 14*14, 64*9) x = self.soft_split2(x).transpose(1, 2) # final tokens # x: (B, 14*14, 64) x = self.project(x) return x 3 T2T 完整模型：

对应图5所示的架构，可以清晰地看到由2部分组成：首先通过 TST 模块 self.tokens_to_token ，再通过 Transformer Backbone，即 self.blocks ，代码风格和 timm 模块的 ViT 模型一致。

class T2T_ViT(nn.Module): def __init__(self, img_size=224, tokens_type='performer', in_chans=3, num_classes=1000, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm, token_dim=64): super().__init__() self.num_classes = num_classes self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models self.tokens_to_token = T2T_module( img_size=img_size, tokens_type=tokens_type, in_chans=in_chans, embed_dim=embed_dim, token_dim=token_dim) num_patches = self.tokens_to_token.num_patches self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) self.pos_embed = nn.Parameter(data=get_sinusoid_encoding(n_position=num_patches + 1, d_hid=embed_dim), requires_grad=False) self.pos_drop = nn.Dropout(p=drop_rate) dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # stochastic depth decay rule self.blocks = nn.ModuleList([ Block( dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale, drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer) for i in range(depth)]) self.norm = norm_layer(embed_dim) # Classifier head self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity() trunc_normal_(self.cls_token, std=.02) self.apply(self._init_weights) def _init_weights(self, m): if isinstance(m, nn.Linear): trunc_normal_(m.weight, std=.02) if isinstance(m, nn.Linear) and m.bias is not None: nn.init.constant_(m.bias, 0) elif isinstance(m, nn.LayerNorm): nn.init.constant_(m.bias, 0) nn.init.constant_(m.weight, 1.0) @torch.jit.ignore def no_weight_decay(self): return {'cls_token'} def get_classifier(self): return self.head def reset_classifier(self, num_classes, global_pool=''): self.num_classes = num_classes self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity() def forward_features(self, x): B = x.shape[0] x = self.tokens_to_token(x) cls_tokens = self.cls_token.expand(B, -1, -1) x = torch.cat((cls_tokens, x), dim=1) x = x + self.pos_embed x = self.pos_drop(x) for blk in self.blocks: x = blk(x) x = self.norm(x) return x[:, 0] def forward(self, x): x = self.forward_features(x) x = self.head(x) return x 4 T2T Backbone 添加 SE 模块的模型：

与基线的不同之处是在 Attention 操作最后添加了：

x = self.se_layer(x)

class SELayer(nn.Module): def __init__(self, channel, reduction=16): super(SELayer, self).__init__() self.avg_pool = nn.AdaptiveAvgPool1d(1) self.fc = nn.Sequential( nn.Linear(channel, channel // reduction, bias=False), nn.ReLU(inplace=True), nn.Linear(channel // reduction, channel, bias=False), nn.Sigmoid() ) def forward(self, x): # x: [B, N, C] x = torch.transpose(x, 1, 2) # [B, C, N] b, c, _ = x.size() y = self.avg_pool(x).view(b, c) y = self.fc(y).view(b, c, 1) x = x * y.expand_as(x) x = torch.transpose(x, 1, 2) # [B, N, C] return x class Attention(nn.Module): def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.): super().__init__() self.num_heads = num_heads head_dim = dim // num_heads self.scale = qk_scale or head_dim ** -0.5 self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias) self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(dim, dim) self.proj_drop = nn.Dropout(proj_drop) self.se_layer = SELayer(dim) def forward(self, x): B, N, C = x.shape qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[2] attn = (q @ k.transpose(-2, -1)) * self.scale attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, N, C) x = self.proj(x) x = self.se_layer(x) x = self.proj_drop(x) return x class Block(nn.Module): def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0., drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm): super().__init__() self.norm1 = norm_layer(dim) self.attn = Attention( dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop) self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity() self.norm2 = norm_layer(dim) mlp_hidden_dim = int(dim * mlp_ratio) self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop) def forward(self, x): x = x + self.drop_path(self.attn(self.norm1(x))) x = x + self.drop_path(self.mlp(self.norm2(x))) return x 5 T2T Backbone 添加 Ghost 模块的模型：

与基线的不同之处是在 MLP 操作的 FC1 层使用了 Ghost 操作，在 Attention 操作的计算Q, K, V 时一般的特征使用了 Ghost 操作，Ghost 操作的具体解读见：

科技猛兽：解读模型压缩5：减少冗余特征的Ghost模块：华为Ghost网络系列解读

https://zhuanlan.zhihu.com/p/346911265

class Mlp_ghost(nn.Module): def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.): super().__init__() out_features = out_features or in_features hidden_features = hidden_features or in_features self.fc1 = nn.Linear(in_features, in_features) self.act = act_layer() self.fc2 = nn.Linear(hidden_features, out_features) self.drop = nn.Dropout(drop) self.ratio = hidden_features//in_features self.cheap_operation2 = nn.Conv1d(in_features, in_features, kernel_size=1, groups=in_features, bias=False) self.cheap_operation3 = nn.Conv1d(in_features, in_features, kernel_size=1, groups=in_features, bias=False) def forward(self, x): # x: [B, N, C] x1 = self.fc1(x) # x1: [B, N, C] x1 = self.act(x1) x2 = self.cheap_operation2(x1.transpose(1,2)) # x2: [B, N, C] x2 = x2.transpose(1,2) x2 = self.act(x2) x3 = self.cheap_operation3(x1.transpose(1, 2)) # x3: [B, N, C] x3 = x3.transpose(1, 2) x3 = self.act(x3) x = torch.cat((x1, x2, x3), dim=2) # x: [B, N, 3C] x = self.drop(x) x = self.fc2(x) x = self.drop(x) return xclass Attention_ghost(nn.Module): def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.): super().__init__() self.num_heads = num_heads head_dim = dim // num_heads self.scale = qk_scale or head_dim ** -0.5 half_dim = int(0.5*dim) self.q = nn.Linear(dim, half_dim, bias=qkv_bias) self.k = nn.Linear(dim, half_dim, bias=qkv_bias) self.v = nn.Linear(dim, half_dim, bias=qkv_bias) self.cheap_operation_q = nn.Conv1d(half_dim, half_dim, kernel_size=1, groups=half_dim, bias=False) self.cheap_operation_k = nn.Conv1d(half_dim, half_dim, kernel_size=1, groups=half_dim, bias=False) self.cheap_operation_v = nn.Conv1d(half_dim, half_dim, kernel_size=1, groups=half_dim, bias=False) self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(dim, dim) self.proj_drop = nn.Dropout(proj_drop) def forward(self, x): B, N, C = x.shape q = self.q(x) k = self.k(x) v = self.v(x) q1 = self.cheap_operation_q(q.transpose(1,2)).transpose(1,2) k1 = self.cheap_operation_k(k.transpose(1,2)).transpose(1,2) v1 = self.cheap_operation_v(v.transpose(1,2)).transpose(1,2) q = torch.cat((q, q1), dim=2).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3) k = torch.cat((k, k1), dim=2).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3) v = torch.cat((v, v1), dim=2).reshape(B, N, self.num_heads, C // self.num_heads).permute(0, 2, 1, 3) attn = (q @ k.transpose(-2, -1)) * self.scale attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, N, C) x = self.proj(x) x = self.proj_drop(x) return x 小结

T2T-ViT 通过 Tokens-to-Token module 来建模一张图片的局部信息，和更高效的 Transformer Backbone 架构设计来提升中间特征的丰富程度减少冗余以提升性能，使得在纯 ImageNet 数据集预训练的视觉 Transformer 的性能超越了 CNN 的 ResNet 架构，其设计的思路和范式对视觉 Transformer 领域的工作带来的积极影响。

32 VOLO刷新CV多项记录，无需额外训练数据，首次在ImageNet 上达到87.1% 论文名称：VOLO: Vision Outlooker for Visual Recognition 论文地址：

VOLO: Vision Outlooker for Visual Recognition

https://arxiv.org/abs/2106.13112

32.1 VOLO原理分析：

VOLO 和 T2T-ViT 来自相同的作者，在 VOLO 这篇文章中，作者认为，限制 ViTs 在ImageNet 分类中的性能的主要因素是它在将精细化的特征和上下文信息 (fine-level features and contexts) 编码为 token 的能力比较差，这会影响 ViT 模型的分类性能。

前面提到 ViT 将图片分成 patches ，再转化成 tokens，当时使用的 patch 大小是16×16。我们当然可以通过减小 patch size 来完成更加精细化的 tokenization，但是者带来的问题是序列的长度过长。假设patch 的大小由 16×16 改为 4×4，序列的长度会变为原来的16倍，而 ViT 模型的计算量与序列长度的平方成正比，所以计算量就会相应地变为原来的256倍，是无法接受的。

为了解决这个问题，作者引入了一种新的 Outlooker 注意力，并提出了一个简单而通用的架构，称为 Vision outlooker (VOLO)。Outlooker 注意力主要将 fine-level 级别的特征和上下文信息更高效地编码到 token 表示中，这些token对识别性能至关重要。

和 T2T-ViT 一样，VOLO 也由2个阶段组成，第1阶段是一堆 Outlooker，作用是生成 fine-level 级别的 tokens，再进入第2阶段，使用 Transformer Blocks 来建模全局信息。在每个阶段的最开始，使用一个 patch embedding 模块将输入映射成期望形状大小的 token。

Outlooker 模块

Outlooker 模块是为了替换 Self-attention 模块，所以给定输入，一个 Outlooker 模块可以写成：

在具体介绍 Outlooker 模块之前读者应该首先了解 nn.Fold 操作，它是 nn.Unfold 操作的反操作。不理解 nn.Fold 操作的读者可以参考：

Fold - PyTorch 1.9.0 documentation

https://pytorch.org/docs/stable/generated/torch.nn.Fold.html

或者我在这里通俗地解释下 nn.Fold 操作：假设我们有个张量维度是，。nn.Fold 操作需要一个的卷积核作用在这个张量上面，做卷积的反操作。得到的输出维度是。

这里的。

式中，分别是 nn.Fold 操作的 stride 参数和 padding 参数。

示例：

>>> fold = nn.Fold(output_size=(4, 5), kernel_size=(2, 2)) >>> input = torch.randn(1, 3 * 2 * 2, 12) >>> output = fold(input) >>> output.size() torch.Size([1, 3, 4, 5])

如下图11所示为 Outlook attention 的结构。

输入一张图片，通过Patch Embedding 进行分块操作，假设分块后的维度是(实际是 )，对于其中的任意一个空间位置 token ，计算它与其他个 tokens 的 Attention 矩阵是不现实的，因为这会导致计算量暴增。所以作者只去计算以它为中心的个 token 的 Attention 矩阵。那么这里的 "以它为中心的个元素" 可以表达为：

$$\begin{equation} \label{eqn:unfold} \mathbf{V}_{\Delta_{i,j}}=\{\mathbf{V}_{i+p-\lfloor \frac{K}{2} \rfloor,j+q-\lfloor \frac{K}{2} \rfloor}\}, \quad 0 \leq p,q 0. else nn.Identity() self.norm2 = norm_layer(dim) mlp_hidden_dim = int(dim * mlp_ratio) self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer) def forward(self, x): x = x + self.drop_path(self.attn(self.norm1(x))) x = x + self.drop_path(self.mlp(self.norm2(x))) return x 3 创建一个 Outlooker 层或者 Transformer 层： def outlooker_blocks(block_fn, index, dim, layers, num_heads=1, kernel_size=3, padding=1,stride=1, mlp_ratio=3., qkv_bias=False, qk_scale=None, attn_drop=0, drop_path_rate=0., **kwargs): """ generate outlooker layer in stage1 return: outlooker layers """ blocks = [] for block_idx in range(layers[index]): block_dpr = drop_path_rate * (block_idx + sum(layers[:index])) / (sum(layers) - 1) blocks.append(block_fn(dim, kernel_size=kernel_size, padding=padding, stride=stride, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, drop_path=block_dpr)) blocks = nn.Sequential(*blocks) return blocks def transformer_blocks(block_fn, index, dim, layers, num_heads, mlp_ratio=3., qkv_bias=False, qk_scale=None, attn_drop=0, drop_path_rate=0., **kwargs): """ generate transformer layers in stage2 return: transformer layers """ blocks = [] for block_idx in range(layers[index]): block_dpr = drop_path_rate * (block_idx + sum(layers[:index])) / (sum(layers) - 1) blocks.append( block_fn(dim, num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, drop_path=block_dpr)) blocks = nn.Sequential(*blocks) return blocks 4 Patch Embedding 代码：

替换 ViT 的分块操作，实质是卷积操作。

class PatchEmbed(nn.Module): """ Image to Patch Embedding. Different with ViT use 1 conv layer, we use 4 conv layers to do patch embedding """ def __init__(self, img_size=224, stem_conv=False, stem_stride=1, patch_size=8, in_chans=3, hidden_dim=64, embed_dim=384): super().__init__() assert patch_size in [4, 8, 16] self.stem_conv = stem_conv if stem_conv: self.conv = nn.Sequential( nn.Conv2d(in_chans, hidden_dim, kernel_size=7, stride=stem_stride, padding=3, bias=False), # 112x112 nn.BatchNorm2d(hidden_dim), nn.ReLU(inplace=True), nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=1, padding=1, bias=False), # 112x112 nn.BatchNorm2d(hidden_dim), nn.ReLU(inplace=True), nn.Conv2d(hidden_dim, hidden_dim, kernel_size=3, stride=1, padding=1, bias=False), # 112x112 nn.BatchNorm2d(hidden_dim), nn.ReLU(inplace=True), ) self.proj = nn.Conv2d(hidden_dim, embed_dim, kernel_size=patch_size // stem_stride, stride=patch_size // stem_stride) self.num_patches = (img_size // patch_size) * (img_size // patch_size) def forward(self, x): if self.stem_conv: x = self.conv(x) x = self.proj(x) # B, C, H, W return x 5 VOLO 整体模型

参数定义：

layers: [x,x,x,x],2个阶段的4个Blocks, 第1个 Block 是 outlooker, 其他3个是 transformer。

patch_size：第一阶段 outlook attention 的 Patch embedding 的 patch 大小。

stem_hidden_dim：Patch embedding 的 hidden dimension。VOLO-D1-D4 是64，D5 是128。

embed_dims, num_heads：embedding dimension 和头的数量。

outlook_attention：是否使用 outlook_attention。

return_mean：是否返回全部 tokens 的平均做分类。

out_kernel, out_stride, out_padding：outlook attention 的 kernel size，stride，padding 参数。

class VOLO(nn.Module): def __init__(self, layers, img_size=224, in_chans=3, num_classes=1000, patch_size=8, stem_hidden_dim=64, embed_dims=None, num_heads=None, downsamples=None, outlook_attention=None, mlp_ratios=None, qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=nn.LayerNorm, post_layers=None, return_mean=False, return_dense=True, mix_token=True, pooling_scale=2, out_kernel=3, out_stride=2, out_padding=1): super().__init__() self.num_classes = num_classes self.patch_embed = PatchEmbed(stem_conv=True, stem_stride=2, patch_size=patch_size, in_chans=in_chans, hidden_dim=stem_hidden_dim, embed_dim=embed_dims[0]) # inital positional encoding, we add positional encoding after outlooker blocks self.pos_embed = nn.Parameter( torch.zeros(1, img_size // patch_size // pooling_scale, img_size // patch_size // pooling_scale, embed_dims[-1])) self.pos_drop = nn.Dropout(p=drop_rate) # set the main block in network network = [] for i in range(len(layers)): if outlook_attention[i]: # stage 1 stage = outlooker_blocks(Outlooker, i, embed_dims[i], layers, downsample=downsamples[i], num_heads=num_heads[i], kernel_size=out_kernel, stride=out_stride, padding=out_padding, mlp_ratio=mlp_ratios[i], qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop_rate, norm_layer=norm_layer) network.append(stage) else: # stage 2 stage = transformer_blocks(Transformer, i, embed_dims[i], layers, num_heads[i], mlp_ratio=mlp_ratios[i], qkv_bias=qkv_bias, qk_scale=qk_scale, drop_path_rate=drop_path_rate, attn_drop=attn_drop_rate, norm_layer=norm_layer) network.append(stage) if downsamples[i]: # downsampling between two stages network.append(Downsample(embed_dims[i], embed_dims[i + 1], 2)) self.network = nn.ModuleList(network)

如果在上述网络之后在接上 post_network，比如 CaiT 里面的 class attention，则：

# set post block, for example, class attention layers self.post_network = None if post_layers is not None: self.post_network = nn.ModuleList([ get_block(post_layers[i], dim=embed_dims[-1], num_heads=num_heads[-1], mlp_ratio=mlp_ratios[-1], qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop_rate, drop_path=0., norm_layer=norm_layer) for i in range(len(post_layers)) ]) self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dims[-1])) trunc_normal_(self.cls_token, std=.02) # set output type self.return_mean = return_mean # if yes, return mean, not use class token self.return_dense = return_dense # if yes, return class token and all feature tokens if return_dense: assert not return_mean, "cannot return both mean and dense" self.mix_token = mix_token self.pooling_scale = pooling_scale if mix_token: # enable token mixing, see token labeling for details. self.beta = 1.0 assert return_dense, "return all tokens if mix_token is enabled" if return_dense: self.aux_head = nn.Linear( embed_dims[-1], num_classes) if num_classes > 0 else nn.Identity() self.norm = norm_layer(embed_dims[-1]) # Classifier head self.head = nn.Linear( embed_dims[-1], num_classes) if num_classes > 0 else nn.Identity() trunc_normal_(self.pos_embed, std=.02) self.apply(self._init_weights) 前向函数： def forward_embeddings(self, x): # patch embedding x = self.patch_embed(x) # B,C,H,W-> B,H,W,C x = x.permute(0, 2, 3, 1) return x def forward_tokens(self, x): for idx, block in enumerate(self.network): if idx == 2: # add positional encoding after outlooker blocks x = x + self.pos_embed x = self.pos_drop(x) x = block(x) B, H, W, C = x.shape x = x.reshape(B, -1, C) return x def forward_cls(self, x): B, N, C = x.shape cls_tokens = self.cls_token.expand(B, -1, -1) x = torch.cat((cls_tokens, x), dim=1) for block in self.post_network: x = block(x) return x def forward(self, x): # step1: patch embedding x = self.forward_embeddings(x) # mix token, see token labeling for details. if self.mix_token and self.training: lam = np.random.beta(self.beta, self.beta) patch_h, patch_w = x.shape[1] // self.pooling_scale, x.shape[ 2] // self.pooling_scale bbx1, bby1, bbx2, bby2 = rand_bbox(x.size(), lam, scale=self.pooling_scale) temp_x = x.clone() sbbx1,sbby1,sbbx2,sbby2=self.pooling_scale*bbx1,self.pooling_scale*bby1,\ self.pooling_scale*bbx2,self.pooling_scale*bby2 temp_x[:, sbbx1:sbbx2, sbby1:sbby2, :] = x.flip(0)[:, sbbx1:sbbx2, sbby1:sbby2, :] x = temp_x else: bbx1, bby1, bbx2, bby2 = 0, 0, 0, 0 # step2: tokens learning in the two stages x = self.forward_tokens(x) # step3: post network, apply class attention or not if self.post_network is not None: x = self.forward_cls(x) x = self.norm(x) if self.return_mean: # if no class token, return mean return self.head(x.mean(1)) x_cls = self.head(x[:, 0]) if not self.return_dense: return x_cls x_aux = self.aux_head( x[:, 1:] ) # generate classes in all feature tokens, see token labeling if not self.training: return x_cls + 0.5 * x_aux.max(1)[0] if self.mix_token and self.training: # reverse "mix token", see token labeling for details. x_aux = x_aux.reshape(x_aux.shape[0], patch_h, patch_w, x_aux.shape[-1]) temp_x = x_aux.clone() temp_x[:, bbx1:bbx2, bby1:bby2, :] = x_aux.flip(0)[:, bbx1:bbx2, bby1:bby2, :] x_aux = temp_x x_aux = x_aux.reshape(x_aux.shape[0], patch_h * patch_w, x_aux.shape[-1]) # return these: 1. class token, 2. classes from all feature tokens, 3. bounding box return x_cls, x_aux, (bbx1, bby1, bbx2, bby2) VOLO-D1配置：

4 个 block，第1个是 Outlooker，后3个是 Transformer。embedding dimension 分别是 192,384,384,384。num_heads 分别是6,12,12,12。只在第1个 Block 下采样。

@register_model def volo_d1(pretrained=False, **kwargs): """ VOLO-D1 model, Params: 27M --layers: [x,x,x,x], four blocks in two stages, the first stage(block) is outlooker, the other three blocks are transformer, we set four blocks, which are easily applied to downstream tasks --embed_dims, --num_heads,: embedding dim, number of heads in each block --downsamples: flags to apply downsampling or not in four blocks --outlook_attention: flags to apply outlook attention or not --mlp_ratios: mlp ratio in four blocks --post_layers: post layers like two class attention layers using [ca, ca] See detail for all args in the class VOLO() """ layers = [4, 4, 8, 2] # num of layers in the four blocks embed_dims = [192, 384, 384, 384] num_heads = [6, 12, 12, 12] mlp_ratios = [3, 3, 3, 3] downsamples = [True, False, False, False] # do downsampling after first block outlook_attention = [True, False, False, False ] # first block is outlooker (stage1), the other three are transformer (stage2) model = VOLO(layers, embed_dims=embed_dims, num_heads=num_heads, mlp_ratios=mlp_ratios, downsamples=downsamples, outlook_attention=outlook_attention, post_layers=['ca', 'ca'], **kwargs) model.default_cfg = default_cfgs['volo'] return model 小结

VOLO 引入了一种新的 Outlooker 注意力，并提出了一个简单而通用的架构，称为 Vision outlooker (VOLO)。Outlooker 注意力主要将 fine-level 级别的特征和上下文信息更高效地编码到 token 表示中，主要的操作是 Unfold 和 Fold 操作来结合每个 token 周围个 tokens 的信息。

如果觉得有用，就请分享到朋友圈吧！

△点击卡片关注极市平台，获取最新CV干货

公众号后台回复“ CVPR21检测 ”获取CVPR2021目标检测论文下载～

极市干货

YOLO教程：一文读懂YOLO V5 与 YOLO V4 ｜大盘点｜YOLO 系目标检测算法总览｜全面解析YOLO V4网络结构

实操教程： PyTorch vs LibTorch：网络推理速度谁更快？｜只用两行代码，我让Transformer推理加速了50倍｜ PyTorch AutoGrad C++层实现

算法技巧（trick）：深度学习训练tricks总结（有实验支撑）｜深度强化学习调参Tricks合集｜长尾识别中的Tricks汇总（AAAI2021 ）

最新CV竞赛： 2021 高通人工智能应用创新大赛｜ CVPR 2021 | Short-video Face Parsing Challenge ｜ 3D人体目标检测与行为分析竞赛开赛，奖池7万+，数据集达16671张！

# 极市平台签约作者 #

科技猛兽

知乎：科技猛兽

清华大学自动化系19级硕士，目前实习于北京华为诺亚方舟实验室。

研究领域：AI边缘计算 (Efficient AI with Tiny Resource)：专注模型压缩，搜索，量化，加速，加法网络，以及它们与其他任务的结合，更好地服务于端侧设备。

作品精选

CVPR2021最佳学生论文提名：Less is More

Transformer一作又出新作！HaloNet：用Self-Attention的方式进行卷积

超越Swin，Transformer屠榜三大视觉任务！微软推出新作：Focal Self-Attention

投稿方式：

添加小编微信Fengcall（微信号：fengcall19），备注：姓名-投稿

△长按添加极市平台小编

觉得有用麻烦给个在看啦~

【本文地址】

公司简介

联系我们