Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
It can be
instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
for the task, similarly to the zero-shot capabilities of GPT-2 and 3.