|
|
|
Swin Transformer |
|
Overview |
|
The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows |
|
by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. |
|
The abstract from the paper is the following: |
|
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone |
|
for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, |
|
such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. |
|
To address these differences, we propose a hierarchical Transformer whose representation is computed with \bold{S}hifted |
|
\bold{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping |
|
local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at |
|
various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it |
|
compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense |
|
prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation |
|
(53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and |
|
+2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. |
|
The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. |
|
|
|
Swin Transformer architecture. Taken from the original paper. |
|
This model was contributed by novice03. The Tensorflow version of this model was contributed by amyeroberts. The original code can be found here. |
|
Usage tips |
|
|
|
Swin pads the inputs supporting any input height and width (if divisible by 32). |
|
Swin can be used as a backbone. When output_hidden_states = True, it will output both hidden_states and reshaped_hidden_states. The reshaped_hidden_states have a shape of (batch, num_channels, height, width) rather than (batch_size, sequence_length, num_channels). |
|
|
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Swin Transformer. |
|
|
|
[SwinForImageClassification] is supported by this example script and notebook. |
|
See also: Image classification task guide |
|
|
|
Besides that: |
|
|
|
[SwinForMaskedImageModeling] is supported by this example script. |
|
|
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
SwinConfig |
|
[[autodoc]] SwinConfig |
|
|
|
SwinModel |
|
[[autodoc]] SwinModel |
|
- forward |
|
SwinForMaskedImageModeling |
|
[[autodoc]] SwinForMaskedImageModeling |
|
- forward |
|
SwinForImageClassification |
|
[[autodoc]] transformers.SwinForImageClassification |
|
- forward |
|
|
|
TFSwinModel |
|
[[autodoc]] TFSwinModel |
|
- call |
|
TFSwinForMaskedImageModeling |
|
[[autodoc]] TFSwinForMaskedImageModeling |
|
- call |
|
TFSwinForImageClassification |
|
[[autodoc]] transformers.TFSwinForImageClassification |
|
- call |
|
|
|
|