Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an
associated input image with self-attention.