ViT, BEiT, DeiT) and any pretrained text autoencoding model as the text encoder (e.g.