VisualBERT is a multi-modal vision and language model.