Spaces:

Ahmadzei
/

RAG

Runtime error

added 3 more tables for large emb model

5fa1a76 over 1 year ago

374 Bytes

	It is a series of bidirectional transformer encoders
	(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
	combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
	visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.