Next, to endow our model with the capability of connecting vision and language | |
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative | |
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification), | |
cross-modality matching, and image question answering. |