Spaces:

Ahmadzei
/

RAG

Runtime error

App Files Files Community

RAG / knowledge_base /model_doc_speech_to_text_2.txt

Ahmadzei

update 1

57bdca5 over 1 year ago

raw

history blame contribute delete

3.32 kB


	Speech2Text2
	Overview
	The Speech2Text2 model is used together with Wav2Vec2 for Speech Translation models proposed in
	Large-Scale Self- and Semi-Supervised Learning for Speech Translation by
	Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
	Speech2Text2 is a decoder-only transformer model that can be used with any speech encoder-only, such as
	Wav2Vec2 or HuBERT for Speech-to-Text tasks. Please refer to the
	SpeechEncoderDecoder class on how to combine Speech2Text2 with any speech encoder-only
	model.
	This model was contributed by Patrick von Platen.
	The original code can be found here.
	Usage tips

	Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see
	the official models .
	Speech2Text2 is always used within the SpeechEncoderDecoder framework.
	Speech2Text2's tokenizer is based on fastBPE.

	Inference
	Speech2Text2's [SpeechEncoderDecoderModel] model accepts raw waveform input values from speech and
	makes use of [~generation.GenerationMixin.generate] to translate the input speech
	autoregressively to the target language.
	The [Wav2Vec2FeatureExtractor] class is responsible for preprocessing the input speech and
	[Speech2Text2Tokenizer] decodes the generated target tokens to the target string. The
	[Speech2Text2Processor] wraps [Wav2Vec2FeatureExtractor] and
	[Speech2Text2Tokenizer] into a single instance to both extract the input features and decode the
	predicted token ids.

	Step-by-step Speech Translation

	thon

	import torch
	from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
	from datasets import load_dataset
	import soundfile as sf
	model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
	processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
	def map_to_array(batch):
	speech, _ = sf.read(batch["file"])
	batch["speech"] = speech
	return batch
	ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	ds = ds.map(map_to_array)
	inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
	generated_ids = model.generate(inputs=inputs["input_values"], attention_mask=inputs["attention_mask"])
	transcription = processor.batch_decode(generated_ids)

	Speech Translation via Pipelines

	The automatic speech recognition pipeline can also be used to translate speech in just a couple lines of code
	thon

	from datasets import load_dataset
	from transformers import pipeline
	librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	asr = pipeline(
	"automatic-speech-recognition",
	model="facebook/s2t-wav2vec2-large-en-de",
	feature_extractor="facebook/s2t-wav2vec2-large-en-de",
	)
	translation_de = asr(librispeech_en[0]["file"])

	See model hub to look for Speech2Text2 checkpoints.
	Resources

	Causal language modeling task guide

	Speech2Text2Config
	[[autodoc]] Speech2Text2Config
	Speech2TextTokenizer
	[[autodoc]] Speech2Text2Tokenizer
	- batch_decode
	- decode
	- save_vocabulary
	Speech2Text2Processor
	[[autodoc]] Speech2Text2Processor
	- call
	- from_pretrained
	- save_pretrained
	- batch_decode
	- decode
	Speech2Text2ForCausalLM
	[[autodoc]] Speech2Text2ForCausalLM
	- forward