5fa1a76
1
2
ViT hybrid is a slight variant of the plain Vision Transformer, by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial "tokens" for the Transformer.