Specifically, each image has two views in our pre-training, i.e, image | |
patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). |
Specifically, each image has two views in our pre-training, i.e, image | |
patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). |