Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, | |
and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly | |
simpler. |
Experiments on four vision-and-language tasks including VQA, VCR, NLVR2, | |
and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly | |
simpler. |