|
|
|
XLSR-Wav2Vec2 |
|
Overview |
|
The XLSR-Wav2Vec2 model was proposed in Unsupervised Cross-Lingual Representation Learning For Speech Recognition by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael |
|
Auli. |
|
The abstract from the paper is the following: |
|
This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw |
|
waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over |
|
masked latent speech representations and jointly learns a quantization of the latents shared across languages. The |
|
resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly |
|
outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction |
|
of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to |
|
a comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong |
|
individual models. Analysis shows that the latent discrete speech representations are shared across languages with |
|
increased sharing for related languages. We hope to catalyze research in low-resource speech understanding by releasing |
|
XLSR-53, a large model pretrained in 53 languages. |
|
The original code can be found here. |
|
Usage tips |
|
|
|
XLSR-Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. |
|
XLSR-Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be |
|
decoded using [Wav2Vec2CTCTokenizer]. |
|
|
|
XLSR-Wav2Vec2's architecture is based on the Wav2Vec2 model, so one can refer to Wav2Vec2's documentation page. |
|
|