It can be | |
instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing | |
for the task, similarly to the zero-shot capabilities of GPT-2 and 3. |
It can be | |
instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing | |
for the task, similarly to the zero-shot capabilities of GPT-2 and 3. |