|
|
|
Data2Vec |
|
Overview |
|
The Data2Vec model was proposed in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli. |
|
Data2Vec proposes a unified framework for self-supervised learning across different data modalities - text, audio and images. |
|
Importantly, predicted targets for pre-training are contextualized latent representations of the inputs, rather than modality-specific, context-independent targets. |
|
The abstract from the paper is the following: |
|
While the general idea of self-supervised learning is identical across modalities, the actual algorithms and |
|
objectives differ widely because they were developed with a single modality in mind. To get us closer to general |
|
self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, |
|
NLP or computer vision. The core idea is to predict latent representations of the full input data based on a |
|
masked view of the input in a selfdistillation setup using a standard Transformer architecture. |
|
Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which |
|
are local in nature, data2vec predicts contextualized latent representations that contain information from |
|
the entire input. Experiments on the major benchmarks of speech recognition, image classification, and |
|
natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches. |
|
Models and code are available at www.github.com/pytorch/fairseq/tree/master/examples/data2vec. |
|
This model was contributed by edugp and patrickvonplaten. |
|
sayakpaul and Rocketknight1 contributed Data2Vec for vision in TensorFlow. |
|
The original code (for NLP and Speech) can be found here. |
|
The original code for vision can be found here. |
|
Usage tips |
|
|
|
Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method. |
|
For Data2VecAudio, preprocessing is identical to [Wav2Vec2Model], including feature extraction |
|
For Data2VecText, preprocessing is identical to [RobertaModel], including tokenization. |
|
For Data2VecVision, preprocessing is identical to [BeitModel], including feature extraction. |
|
|
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Data2Vec. |
|
|
|
[Data2VecVisionForImageClassification] is supported by this example script and notebook. |
|
To fine-tune [TFData2VecVisionForImageClassification] on a custom dataset, see this notebook. |
|
|
|
Data2VecText documentation resources |
|
- Text classification task guide |
|
- Token classification task guide |
|
- Question answering task guide |
|
- Causal language modeling task guide |
|
- Masked language modeling task guide |
|
- Multiple choice task guide |
|
Data2VecAudio documentation resources |
|
- Audio classification task guide |
|
- Automatic speech recognition task guide |
|
Data2VecVision documentation resources |
|
- Image classification |
|
- Semantic segmentation |
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
Data2VecTextConfig |
|
[[autodoc]] Data2VecTextConfig |
|
Data2VecAudioConfig |
|
[[autodoc]] Data2VecAudioConfig |
|
Data2VecVisionConfig |
|
[[autodoc]] Data2VecVisionConfig |
|
|
|
Data2VecAudioModel |
|
[[autodoc]] Data2VecAudioModel |
|
- forward |
|
Data2VecAudioForAudioFrameClassification |
|
[[autodoc]] Data2VecAudioForAudioFrameClassification |
|
- forward |
|
Data2VecAudioForCTC |
|
[[autodoc]] Data2VecAudioForCTC |
|
- forward |
|
Data2VecAudioForSequenceClassification |
|
[[autodoc]] Data2VecAudioForSequenceClassification |
|
- forward |
|
Data2VecAudioForXVector |
|
[[autodoc]] Data2VecAudioForXVector |
|
- forward |
|
Data2VecTextModel |
|
[[autodoc]] Data2VecTextModel |
|
- forward |
|
Data2VecTextForCausalLM |
|
[[autodoc]] Data2VecTextForCausalLM |
|
- forward |
|
Data2VecTextForMaskedLM |
|
[[autodoc]] Data2VecTextForMaskedLM |
|
- forward |
|
Data2VecTextForSequenceClassification |
|
[[autodoc]] Data2VecTextForSequenceClassification |
|
- forward |
|
Data2VecTextForMultipleChoice |
|
[[autodoc]] Data2VecTextForMultipleChoice |
|
- forward |
|
Data2VecTextForTokenClassification |
|
[[autodoc]] Data2VecTextForTokenClassification |
|
- forward |
|
Data2VecTextForQuestionAnswering |
|
[[autodoc]] Data2VecTextForQuestionAnswering |
|
- forward |
|
Data2VecVisionModel |
|
[[autodoc]] Data2VecVisionModel |
|
- forward |
|
Data2VecVisionForImageClassification |
|
[[autodoc]] Data2VecVisionForImageClassification |
|
- forward |
|
Data2VecVisionForSemanticSegmentation |
|
[[autodoc]] Data2VecVisionForSemanticSegmentation |
|
- forward |
|
|
|
TFData2VecVisionModel |
|
[[autodoc]] TFData2VecVisionModel |
|
- call |
|
TFData2VecVisionForImageClassification |
|
[[autodoc]] TFData2VecVisionForImageClassification |
|
- call |
|
TFData2VecVisionForSemanticSegmentation |
|
[[autodoc]] TFData2VecVisionForSemanticSegmentation |
|
- call |
|
|
|
|