This model can be used to align the vision-text embeddings using CLIP like contrastive image-text | |
training and then can be used for zero-shot vision tasks such image-classification or retrieval. |
This model can be used to align the vision-text embeddings using CLIP like contrastive image-text | |
training and then can be used for zero-shot vision tasks such image-classification or retrieval. |