|
|
|
Vision Transformer (ViT) |
|
Overview |
|
The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition |
|
at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk |
|
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob |
|
Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining |
|
very good results compared to familiar convolutional architectures. |
|
The abstract from the paper is the following: |
|
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its |
|
applications to computer vision remain limited. In vision, attention is either applied in conjunction with |
|
convolutional networks, or used to replace certain components of convolutional networks while keeping their overall |
|
structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to |
|
sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of |
|
data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), |
|
Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring |
|
substantially fewer computational resources to train. |
|
|
|
ViT architecture. Taken from the original paper. |
|
Following the original Vision Transformer, some follow-up works have been made: |
|
|
|
DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision transformers. |
|
The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into [ViTModel] or |
|
[ViTForImageClassification]. There are 4 variants available (in 3 different sizes): facebook/deit-tiny-patch16-224, |
|
facebook/deit-small-patch16-224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. Note that one should |
|
use [DeiTImageProcessor] in order to prepare images for the model. |
|
|
|
BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained |
|
vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE. |
|
|
|
DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision Transformers trained using |
|
the DINO method show very interesting properties not seen with convolutional models. They are capable of segmenting |
|
objects, without having ever been trained to do so. DINO checkpoints can be found on the hub. |
|
|
|
MAE (Masked Autoencoders) by Facebook AI. By pre-training Vision Transformers to reconstruct pixel values for a high portion |
|
(75%) of masked patches (using an asymmetric encoder-decoder architecture), the authors show that this simple method outperforms |
|
supervised pre-training after fine-tuning. |
|
|
|
This model was contributed by nielsr. The original code (written in JAX) can be |
|
found here. |
|
Note that we converted the weights from Ross Wightman's timm library, |
|
who already converted the weights from JAX to PyTorch. Credits go to him! |
|
Usage tips |
|
|
|
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, |
|
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be |
|
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of |
|
vectors to a standard Transformer encoder. |
|
As the Vision Transformer expects each image to be of the same size (resolution), one can use |
|
[ViTImageProcessor] to resize (or rescale) and normalize images for the model. |
|
Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of |
|
each checkpoint. For example, google/vit-base-patch16-224 refers to a base-sized architecture with patch |
|
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the hub. |
|
The available checkpoints are either (1) pre-trained on ImageNet-21k (a collection of |
|
14 million images and 21k classes) only, or (2) also fine-tuned on ImageNet (also referred to as ILSVRC 2012, a collection of 1.3 million |
|
images and 1,000 classes). |
|
The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to |
|
use a higher resolution than pre-training (Touvron et al., 2019), (Kolesnikov |
|
et al., 2020). In order to fine-tune at higher resolution, the authors perform |
|
2D interpolation of the pre-trained position embeddings, according to their location in the original image. |
|
The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed |
|
an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked |
|
language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant |
|
improvement of 2% to training from scratch, but still 4% behind supervised pre-training. |
|
|
|
Resources |
|
Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found here. |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
ViTForImageClassification is supported by: |
|
|
|
A blog post on how to Fine-Tune ViT for Image Classification with Hugging Face Transformers |
|
A blog post on Image Classification with Hugging Face Transformers and Keras |
|
A notebook on Fine-tuning for Image Classification with Hugging Face Transformers |
|
A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with the Hugging Face Trainer |
|
A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with PyTorch Lightning |
|
|
|
⚗️ Optimization |
|
|
|
A blog post on how to Accelerate Vision Transformer (ViT) with Quantization using Optimum |
|
|
|
⚡️ Inference |
|
|
|
A notebook on Quick demo: Vision Transformer (ViT) by Google Brain |
|
|
|
🚀 Deploy |
|
|
|
A blog post on Deploying Tensorflow Vision Models in Hugging Face with TF Serving |
|
A blog post on Deploying Hugging Face ViT on Vertex AI |
|
A blog post on Deploying Hugging Face ViT on Kubernetes with TF Serving |
|
|
|
ViTConfig |
|
[[autodoc]] ViTConfig |
|
ViTFeatureExtractor |
|
[[autodoc]] ViTFeatureExtractor |
|
- call |
|
ViTImageProcessor |
|
[[autodoc]] ViTImageProcessor |
|
- preprocess |
|
|
|
ViTModel |
|
[[autodoc]] ViTModel |
|
- forward |
|
ViTForMaskedImageModeling |
|
[[autodoc]] ViTForMaskedImageModeling |
|
- forward |
|
ViTForImageClassification |
|
[[autodoc]] ViTForImageClassification |
|
- forward |
|
|
|
TFViTModel |
|
[[autodoc]] TFViTModel |
|
- call |
|
TFViTForImageClassification |
|
[[autodoc]] TFViTForImageClassification |
|
- call |
|
|
|
FlaxVitModel |
|
[[autodoc]] FlaxViTModel |
|
- call |
|
FlaxViTForImageClassification |
|
[[autodoc]] FlaxViTForImageClassification |
|
- call |
|
|
|
|