It is a series of bidirectional transformer encoders | |
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a | |
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked | |
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. |