视觉Transformer预训练模型的胸腔X线影像多标签分类 您所在的位置:网站首页 process的模型 视觉Transformer预训练模型的胸腔X线影像多标签分类

视觉Transformer预训练模型的胸腔X线影像多标签分类

2023-03-26 20:48| 来源: 网络整理| 查看: 265

视觉Transformer预训练模型的胸腔X线影像多标签分类

邢素霞,鞠子涵,刘子娇,王瑜,范福强(北京工商大学)

摘 要 摘 要:目的 基于计算机的胸腔X线影像疾病检测和分类目前存在误诊率高,准确率低的问题。拟在视觉Transformer(Vision Transformer,ViT)预训练模型的基础上,通过迁移学习方法,实现胸腔X线影像辅助诊断,提高诊断准确率和效率。方法 选用带有卷积神经网络(Convolutional Neural Networks,CNN)的ViT模型,其在超大规模自然图像数据集中进行了预训练;通过微调模型结构,使用预训练的ViT模型参数初始化主干网络,并迁移至胸腔X线影像数据集中再次训练,实现疾病多标签分类。结果 在IU X-Ray数据集中对ViT迁移学习前、后模型平均AUC(area under ROC curve)得分进行对比分析实验。结果表明,预训练ViT模型平均AUC得分为0.774,与不使用迁移学习相比提升0.208。并针对模型结构和数据预处理进行了消融实验,对ViT中的注意力机制进行了可视化,进一步验证了模型有效性。最后使用Chest X-Ray14和CheXpert数据集训练微调后的ViT模型,平均AUC得分为0.839和0.806,与目前其他方法相比分别有0.014至0.031的提升。结论 与前人的方法相比,ViT模型胸腔X线影像的多标签分类精确度更高,且迁移学习可以在降低训练成本的同时提升ViT模型的分类性能和泛化性。消融实验与模型可视化表明,包含CNN结构的ViT模型能重点关注有意义的区域,高效获取胸腔X线影像的视觉特征。 关键词 胸腔X线影像 多标签分类 卷积神经网络 视觉Transformer 迁移学习 Multi-label classification of chest X-ray images with pre-trained Vision Transformer model

Xing Suxia,Ju Zihan,Liu Zijiao,Wang Yu,Fan Fuqiang(Beijing tachnology and business university)

Abstract Abstract: Objective Chest X-ray, as an important screening and diagnostic method in the radiology department, has been widely used in clinical medicine diagnosis. At present, most of the interpretation of chest X-ray images relies on human observation. Affected by work pressure and clinical experience, doctors have prone to misdiagnosis and missed diagnoses . Using computers to automatically detect and identify one or more potential diseases in images can effectively assist doctors in improving diagnostic efficiency and accuracy. Compared with natural images, abnormal areas in chest X-ray images have a small proportion and complex representations, and there may be multiple diseases in a single image, which makes it difficult to accurately detect and distinguish lesions. With the development of artificial intelligence and wise information technology of med, deep learning models represented by Convolutional Neural Network(CNN) have been widely used in the field of medical imaging. The structure of the CNN convolution kernel has particularly sensitive to local detail information and could extract rich image features. However, the convolution kernel cannot obtain global information, and the extracted features contain redundant information such as background, muscles, and bones. Lead to the model’s performance in multi-label classification tasks has affected to a certain extent. At present, the Vision Transformer(ViT) model has achieved excellent results in computer vision-related tasks. ViT can simultaneously capture information in different regions of the entire image and focus on meaningful part. However, ViT needs to use large-scale dataset training to achieve good performance. Due to some reasons such as patient privacy and manual annotate costs, the size of the chest X-ray image data set has been limited. To reduce the model"s dependence on data scale and improve the performance of multi-label classification, it has been proposed to use the ViT pre-training model with CNN to achieve the assisted diagnosis of chest X-ray image and disease multi-label classification through the transfer learning method. Methods The ViT model with CNN, which is pre-trained on a huge scale natural dataset, is used to obtain the initial parameters of the model. The model structure is fine-tuned according to the characteristics of chest X-ray dataset. A 1×1 convolution layer is used to convert the chest X-ray images channels from 1 to 3. The number of output nodes of the linear layer in the classifier is adjusted from 1000 to the number of chest X-ray classification labels, and use the Sigmoid as an activation function. The parameters of the backbone network are initialized with the pre-trained ViT model parameters, then trained in the chest X-ray dataset to complete multi-label classification. In the experiment, Python3.7 and PyTorch1.8 are used to construct the model and RTX3090 GPU for training. Stochastic Gradient Descent(SGD) optimizer, Binary Cross-Entropy(BCE) loss function, an initial learning rate of 1e-3, the cosine annealing learning rate decay are used. During the training process, Each image is scaled to a size of 512×512, and then a 224×224 area is randomly cropped as the model input, and data augmentation is performed by randomly utilizing some of the flipping, perspective transformation, shearing, translation, zooming, and changing brightness. During the testing process, the chest X-ray image is scaled to 256×256 and center crop a 224×224 area to input the trained model. Results We compare the model performance before and after using ViT transfer learning. The experiment is performed on the IU X-Ray, which is a small-scale chest X-ray dataset. The model is quantitatively evaluated using the average of AUC(area under ROC curve) scores across all classification labels. The results show that the average AUC score of the pre-trained ViT model is 0.774. The accuracy and training efficiency of the not pre-trained ViT model is dropped significantly, and the average AUC score is only 0.566, which is 0.208 lower than that after transfer learning. In addition, the attention mechanism heatmap is generated according to the ViT model, which increases the interpretability of the model. A series of ablation experiments are carried out for data augmentation, model structure, and batch size design, which proves the effectiveness of different steps of the algorithm in this paper. Finally, the fine-tuned VIT model is trained on the Chest-Ray14 and CheXpert dataset. and the average AUC score is 0.839 and 0.806, which is 0.014 and 0.031 improvement compared to the other current methods. Conclusions In this study, a pre-trained ViT model is used for the multi-label classification of chest X-ray images by transfer learning. The experimental results show that compared with the current methods, ViT has stronger multi-label classification performance in chest X-ray images, and its attention mechanism can help the model focus on important areas such as the interior of the chest cavity and the heart. Transfer learning can improve the classification performance and model generalization of the ViT in small-scale datasets, and greatly reduce the training cost. Ablation experiments show that the combined model of CNN and Transformer is superior to any single-structure model. Data enhancement and reducing the batch size can improve the performance of the model, but the smaller the batch, the longer it takes to train. In the next work, we will further carry out the research of improve the model"s ability to extract complex disease and high-level semantic information such as small lesions, disease location, and severity. Keywords chest X-ray images, multi-label classification, convolutional neural network, vision transformer, transfer learning


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有