Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
VisualBERT predicts the masked text based on the unmasked text and the visual embeddings, and it also has to predict whether the text is aligned with the image.