|
|
|
DeiT |
|
Overview |
|
The DeiT model was proposed in Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre |
|
Sablayrolles, Hervé Jégou. The Vision Transformer (ViT) introduced in Dosovitskiy et al., 2020 has shown that one can match or even outperform existing convolutional neural |
|
networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on |
|
expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more |
|
efficiently trained transformers for image classification, requiring far less data and far less computing resources |
|
compared to the original ViT models. |
|
The abstract from the paper is the following: |
|
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image |
|
classification. However, these visual transformers are pre-trained with hundreds of millions of images using an |
|
expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free |
|
transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision |
|
transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external |
|
data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation |
|
token ensuring that the student learns from the teacher through attention. We show the interest of this token-based |
|
distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets |
|
for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and |
|
models. |
|
This model was contributed by nielsr. The TensorFlow version of this model was added by amyeroberts. |
|
Usage tips |
|
|
|
Compared to ViT, DeiT models use a so-called distillation token to effectively learn from a teacher (which, in the |
|
DeiT paper, is a ResNet like-model). The distillation token is learned through backpropagation, by interacting with |
|
the class ([CLS]) and patch tokens through the self-attention layers. |
|
There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top |
|
of the final hidden state of the class token and not using the distillation signal, or (2) by placing both a |
|
prediction head on top of the class token and on top of the distillation token. In that case, the [CLS] prediction |
|
head is trained using regular cross-entropy between the prediction of the head and the ground-truth label, while the |
|
distillation prediction head is trained using hard distillation (cross-entropy between the prediction of the |
|
distillation head and the label predicted by the teacher). At inference time, one takes the average prediction |
|
between both heads as final prediction. (2) is also called "fine-tuning with distillation", because one relies on a |
|
teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds to |
|
[DeiTForImageClassification] and (2) corresponds to |
|
[DeiTForImageClassificationWithTeacher]. |
|
Note that the authors also did try soft distillation for (2) (in which case the distillation prediction head is |
|
trained using KL divergence to match the softmax output of the teacher), but hard distillation gave the best results. |
|
All released checkpoints were pre-trained and fine-tuned on ImageNet-1k only. No external data was used. This is in |
|
contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for |
|
pre-training. |
|
The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into |
|
[ViTModel] or [ViTForImageClassification]. Techniques like data |
|
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset |
|
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes): |
|
facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-224, facebook/deit-base-patch16-224 and |
|
facebook/deit-base-patch16-384. Note that one should use [DeiTImageProcessor] in order to |
|
prepare images for the model. |
|
|
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT. |
|
|
|
[DeiTForImageClassification] is supported by this example script and notebook. |
|
See also: Image classification task guide |
|
|
|
Besides that: |
|
|
|
[DeiTForMaskedImageModeling] is supported by this example script. |
|
|
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
DeiTConfig |
|
[[autodoc]] DeiTConfig |
|
DeiTFeatureExtractor |
|
[[autodoc]] DeiTFeatureExtractor |
|
- call |
|
DeiTImageProcessor |
|
[[autodoc]] DeiTImageProcessor |
|
- preprocess |
|
|
|
DeiTModel |
|
[[autodoc]] DeiTModel |
|
- forward |
|
DeiTForMaskedImageModeling |
|
[[autodoc]] DeiTForMaskedImageModeling |
|
- forward |
|
DeiTForImageClassification |
|
[[autodoc]] DeiTForImageClassification |
|
- forward |
|
DeiTForImageClassificationWithTeacher |
|
[[autodoc]] DeiTForImageClassificationWithTeacher |
|
- forward |
|
|
|
TFDeiTModel |
|
[[autodoc]] TFDeiTModel |
|
- call |
|
TFDeiTForMaskedImageModeling |
|
[[autodoc]] TFDeiTForMaskedImageModeling |
|
- call |
|
TFDeiTForImageClassification |
|
[[autodoc]] TFDeiTForImageClassification |
|
- call |
|
TFDeiTForImageClassificationWithTeacher |
|
[[autodoc]] TFDeiTForImageClassificationWithTeacher |
|
- call |
|
|
|
|