The core idea is to predict latent representations of the full input data based on a | |
masked view of the input in a selfdistillation setup using a standard Transformer architecture. |
The core idea is to predict latent representations of the full input data based on a | |
masked view of the input in a selfdistillation setup using a standard Transformer architecture. |