5fa1a76
1
We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model.