Spaces:

Ahmadzei
/

RAG

Runtime error

RAG

File size: 374 Bytes

5fa1a76

It is a series of bidirectional transformer encoders
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.