Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which | |
are local in nature, data2vec predicts contextualized latent representations that contain information from | |
the entire input. |
Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which | |
are local in nature, data2vec predicts contextualized latent representations that contain information from | |
the entire input. |