These changes introduce desirable properties of convolutional neural networks (CNNs) | |
to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, | |
global context, and better generalization). |