We demonstrate that the simple pre-training task of predicting which caption goes | |
with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 | |
million (image, text) pairs collected from the internet. |