First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates | |
only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask | |
tokens. |