File size: 194 Bytes
5fa1a76
 
1
2
This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
training and then can be used for zero-shot vision tasks such image-classification or retrieval.