File size: 187 Bytes
5fa1a76
 
1
2
ViT hybrid is a slight variant of the plain Vision Transformer,
by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial "tokens" for the Transformer.