Image GPT

2023-07-08 07:20| 来源: 网络整理| 查看: 265

Unsupervised and self-supervised learning,[^reference-1] or learning without human-labeled data, is a longstanding challenge of machine learning. Recently, it has seen incredible success in language, as transformer[^reference-2] models like BERT,[^reference-3] GPT-2,[^reference-4] RoBERTa,[^reference-5] T5,[^reference-6] and other variants[^reference-7][^reference-8][^reference-9][^reference-10] have achieved top performance on a wide array of language tasks. However, the same broad class of models has not been successful in producing strong features for image classification.[^reference-11] Our work aims to understand and bridge this gap.

Transformer models like BERT and GPT-2 are domain agnostic, meaning that they can be directly applied to 1-D sequences of any form. When we train GPT-2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category. This is evidenced by the diverse range of coherent image samples it generates, even without the guidance of human provided labels. As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy[^footnote-accuracy] on ImageNet.

Evaluation Dataset Our Result Best non-iGPT Result Logistic regression on learned features (linear probe) CIFAR-10 96.3 iGPT-L 32x32 w/ 1536 features 95.3 SimCLR[^reference-12] w/ 8192 features CIFAR-100 82.8 iGPT-L 32x32 w/ 1536 features 80.2 SimCLR w/ 8192 features STL-10 95.5 iGPT-L 32x32 w/ 1536 features 94.2 AMDIM[^reference-13] w/ 8192 features ImageNet 72.0 iGPT-XLa 64x64 w/ 15360 features 76.5 SimCLR w/ 8192 features Full fine-tune CIFAR-10 99.0 iGPT-L 32x32, trained on ImageNet 99.0b GPipe,[^reference-15] trained on ImageNet ImageNet 32x32 66.3 iGPT-L 32x32 70.2 Isometric Nets[^reference-16] We only show ImageNet linear probe accuracy for iGPT-XL since other experiments did not finish before we needed to transition to different supercomputing facilities. Bit-L, trained on JFT (300M images with 18K classes), achieved a result of 99.3.

To highlight the potential of generative[^reference-17][^reference-18] sequence modeling[^reference-19][^reference-20][^reference-21][^reference-22] as a general purpose unsupervised learning algorithm, we deliberately use the same transformer architecture as GPT-2 in language. As a consequence, we require significantly more compute in order to produce features competitive with those from top unsupervised convolutional nets.[^reference-13][^reference-23][^reference-24][^reference-25][^reference-12] However, our results suggest that when faced with a new domain where the correct model priors are unknown, a large GPT-2 can learn excellent features without the need for domain-specific[^reference-26][^reference-27][^reference-28] architectural design choices.

【本文地址】

公司简介

联系我们