Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
252 Bytes
The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models.