Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
training and then can be used for zero-shot vision tasks such image-classification or retrieval.