File size: 161 Bytes
5fa1a76
 
1
2
The text input is concatenated in the front of the visual embeddings in the embedding
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT.