|
|
|
Dilated Neighborhood Attention Transformer |
|
Overview |
|
DiNAT was proposed in Dilated Neighborhood Attention Transformer |
|
by Ali Hassani and Humphrey Shi. |
|
It extends NAT by adding a Dilated Neighborhood Attention pattern to capture global context, |
|
and shows significant performance improvements over it. |
|
The abstract from the paper is the following: |
|
*Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, |
|
domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have |
|
also gained significant attention, thanks to their performance and easy integration into existing frameworks. |
|
These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) |
|
or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, |
|
local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, |
|
and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and |
|
efficient extension to NA that can capture more global context and expand receptive fields exponentially at no |
|
additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we |
|
introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. |
|
DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. |
|
Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection, |
|
1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation. |
|
Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ) |
|
and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data). |
|
It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), |
|
and ranks second on Cityscapes (84.5 mIoU) (no extra data). * |
|
|
|
Neighborhood Attention with different dilation values. |
|
Taken from the original paper. |
|
This model was contributed by Ali Hassani. |
|
The original code can be found here. |
|
Usage tips |
|
DiNAT can be used as a backbone. When output_hidden_states = True, |
|
it will output both hidden_states and reshaped_hidden_states. The reshaped_hidden_states have a shape of (batch, num_channels, height, width) rather than (batch_size, height, width, num_channels). |
|
Notes: |
|
- DiNAT depends on NATTEN's implementation of Neighborhood Attention and Dilated Neighborhood Attention. |
|
You can install it with pre-built wheels for Linux by referring to shi-labs.com/natten, or build on your system by running pip install natten. |
|
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet. |
|
- Patch size of 4 is only supported at the moment. |
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiNAT. |
|
|
|
[DinatForImageClassification] is supported by this example script and notebook. |
|
See also: Image classification task guide |
|
|
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
DinatConfig |
|
[[autodoc]] DinatConfig |
|
DinatModel |
|
[[autodoc]] DinatModel |
|
- forward |
|
DinatForImageClassification |
|
[[autodoc]] DinatForImageClassification |
|
- forward |