Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
Specifically, each image has two views in our pre-training, i.e, image
patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens).