|
|
|
Speech2Text |
|
Overview |
|
The Speech2Text model was proposed in fairseq S2T: Fast Speech-to-Text Modeling with fairseq by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. It's a |
|
transformer-based seq2seq (encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech |
|
Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are |
|
fed into the encoder. The model is trained with standard autoregressive cross-entropy loss and generates the |
|
transcripts/translations autoregressively. Speech2Text has been fine-tuned on several datasets for ASR and ST: |
|
LibriSpeech, CoVoST 2, MuST-C. |
|
This model was contributed by valhalla. The original code can be found here. |
|
Inference |
|
Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech |
|
signal. It's a transformer-based seq2seq model, so the transcripts/translations are generated autoregressively. The |
|
generate() method can be used for inference. |
|
The [Speech2TextFeatureExtractor] class is responsible for extracting the log-mel filter-bank |
|
features. The [Speech2TextProcessor] wraps [Speech2TextFeatureExtractor] and |
|
[Speech2TextTokenizer] into a single instance to both extract the input features and decode the |
|
predicted token ids. |
|
The feature extractor depends on torchaudio and the tokenizer depends on sentencepiece so be sure to |
|
install those packages before running the examples. You could either install those as extra speech dependencies with |
|
pip install transformers"[speech, sentencepiece]" or install the packages separately with pip install torchaudio sentencepiece. Also torchaudio requires the development version of the libsndfile package which can be installed via a system package manager. On Ubuntu it can |
|
be installed as follows: apt install libsndfile1-dev |
|
|
|
ASR and Speech Translation |
|
|
|
thon |
|
|
|
import torch |
|
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration |
|
from datasets import load_dataset |
|
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr") |
|
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr") |
|
ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") |
|
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt") |
|
generated_ids = model.generate(inputs["input_features"], attention_mask=inputs["attention_mask"]) |
|
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
transcription |
|
['mister quilter is the apostle of the middle classes and we are glad to welcome his gospel'] |
|
|
|
Multilingual speech translation |
|
|
|
For multilingual speech translation models, eos_token_id is used as the decoder_start_token_id and |
|
the target language id is forced as the first generated token. To force the target language id as the first |
|
generated token, pass the forced_bos_token_id parameter to the generate() method. The following |
|
example shows how to transate English speech to French text using the facebook/s2t-medium-mustc-multilingual-st |
|
checkpoint. |
|
thon |
|
|
|
import torch |
|
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration |
|
from datasets import load_dataset |
|
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st") |
|
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st") |
|
ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") |
|
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt") |
|
generated_ids = model.generate( |
|
inputs["input_features"], |
|
attention_mask=inputs["attention_mask"], |
|
forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"], |
|
) |
|
translation = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
translation |
|
["(Vidéo) Si M. Kilder est l'apossible des classes moyennes, et nous sommes heureux d'être accueillis dans son évangile."] |
|
|
|
See the model hub to look for Speech2Text checkpoints. |
|
Speech2TextConfig |
|
[[autodoc]] Speech2TextConfig |
|
Speech2TextTokenizer |
|
[[autodoc]] Speech2TextTokenizer |
|
- build_inputs_with_special_tokens |
|
- get_special_tokens_mask |
|
- create_token_type_ids_from_sequences |
|
- save_vocabulary |
|
Speech2TextFeatureExtractor |
|
[[autodoc]] Speech2TextFeatureExtractor |
|
- call |
|
Speech2TextProcessor |
|
[[autodoc]] Speech2TextProcessor |
|
- call |
|
- from_pretrained |
|
- save_pretrained |
|
- batch_decode |
|
- decode |
|
|
|
Speech2TextModel |
|
[[autodoc]] Speech2TextModel |
|
- forward |
|
Speech2TextForConditionalGeneration |
|
[[autodoc]] Speech2TextForConditionalGeneration |
|
- forward |
|
|
|
TFSpeech2TextModel |
|
[[autodoc]] TFSpeech2TextModel |
|
- call |
|
TFSpeech2TextForConditionalGeneration |
|
[[autodoc]] TFSpeech2TextForConditionalGeneration |
|
- call |
|
|
|
|