RAG / knowledge_base /model_doc_speech_to_text_2.txt
Ahmadzei's picture
update 1
57bdca5
Speech2Text2
Overview
The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in
Large-Scale Self- and Semi-Supervised Learning for Speech Translation by
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as
Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the
SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only
model.
This model was contributed by Patrick von Platen.
The original code can be found here.
Usage tips
Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
the official models .
Speech2Text2 is always used within the SpeechEncoderDecoder framework.
Speech2Text2's tokenizer is based on fastBPE.
Inference
Speech2Text2's [SpeechEncoderDecoderModel] model accepts raw waveform input values from speech and
makes use of [~generation.GenerationMixin.generate] to translate the input speech
autoregressively to the target language.
The [Wav2Vec2FeatureExtractor] class is responsible for preprocessing the input speech and
[Speech2Text2Tokenizer] decodes the generated target tokens to the target string. The
[Speech2Text2Processor] wraps [Wav2Vec2FeatureExtractor] and
[Speech2Text2Tokenizer] into a single instance to both extract the input features and decode the
predicted token ids.
Step-by-step Speech Translation
thon
import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
from datasets import load_dataset
import soundfile as sf
model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)
inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(inputs=inputs["input_values"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)
Speech Translation via Pipelines
The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code
thon
from datasets import load_dataset
from transformers import pipeline
librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
asr = pipeline(
"automatic-speech-recognition",
model="facebook/s2t-wav2vec2-large-en-de",
feature_extractor="facebook/s2t-wav2vec2-large-en-de",
)
translation_de = asr(librispeech_en[0]["file"])
See model hub to look for Speech2Text2 checkpoints.
Resources
Causal language modeling task guide
Speech2Text2Config
[[autodoc]] Speech2Text2Config
Speech2TextTokenizer
[[autodoc]] Speech2Text2Tokenizer
- batch_decode
- decode
- save_vocabulary
Speech2Text2Processor
[[autodoc]] Speech2Text2Processor
- call
- from_pretrained
- save_pretrained
- batch_decode
- decode
Speech2Text2ForCausalLM
[[autodoc]] Speech2Text2ForCausalLM
- forward