Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
The text input is concatenated in the front of the visual embeddings in the embedding
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT.