Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
We demonstrate that the simple pre-training task of predicting which caption goes
with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
million (image, text) pairs collected from the internet.