Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
raw
history blame contribute delete
374 Bytes
It is a series of bidirectional transformer encoders
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.