Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
ViT hybrid is a slight variant of the plain Vision Transformer,
by leveraging a convolutional backbone (specifically, BiT) whose features are used as initial "tokens" for the Transformer.