Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality.