Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
We thus propose the LXMERT (Learning Cross-Modality
Encoder Representations from Transformers) framework to learn these vision-and-language connections.