We thus propose the LXMERT (Learning Cross-Modality | |
Encoder Representations from Transformers) framework to learn these vision-and-language connections. |
We thus propose the LXMERT (Learning Cross-Modality | |
Encoder Representations from Transformers) framework to learn these vision-and-language connections. |