Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language.