stt-2.6b-en-trfs / README.md

Update README.md

005de8e verified about 2 months ago

5.88 kB

	---
	license: cc-by-4.0
	language:
	- en
	library_name: transformers
	tags:
	- audio
	- automatic-speech-recognition
	---
	# Model Card for Kyutai STT

	This repo is meant to use the model with [Transformers](https://github.com/huggingface/transformers) 🤗

	Starting with `transformers >= 4.53.0` and above, you can now run Kyutai STT natively!
	```bash
	pip install -U transformers
	```

	Inference:
	```python
	import torch
	from datasets import load_dataset, Audio
	from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

	# 1. load the model and the processor
	torch_device = "cuda" if torch.cuda.is_available() else "cpu"
	model_id = "kyutai/stt-2.6b-en-trfs"

	processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
	model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto")

	# 2. load audio samples
	ds = load_dataset(
	"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
	)
	ds = ds.cast_column("audio", Audio(sampling_rate=24000))

	# 3. prepare the model inputs
	inputs = processor(
	ds[0]["audio"]["array"],
	)
	inputs.to(torch_device)

	# 4. infer the model
	output_tokens = model.generate(**inputs)

	# 5. decode the generated tokens
	print(processor.batch_decode(output_tokens, skip_special_tokens=True))
	```

	Batched inference:
	```python
	import torch
	from datasets import load_dataset, Audio
	from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

	# 1. load the model and the processor
	torch_device = "cuda" if torch.cuda.is_available() else "cpu"
	model_id = "kyutai/stt-2.6b-en-trfs"

	processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
	model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto")

	# 2. load audio samples
	ds = load_dataset(
	"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
	)
	ds = ds.cast_column("audio", Audio(sampling_rate=24000))

	# 3. prepare the model inputs
	audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
	inputs = processor(audio_arrays, return_tensors="pt", padding=True)
	inputs = inputs.to(torch_device)

	# 4. infer the model
	output_tokens = model.generate(**inputs)

	# 5. decode the generated tokens
	decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
	for output in decoded_outputs:
	print(output)
	```

	See also the [project page](https://kyutai.org/next/stt)
	and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).

	This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR).
	Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript,
	our model starts to output the transcript as soon as a few seconds of audio become available.

	## Model Details

	The model architecture is a Transformer that consumes audio tokenized by Mimi (see [the Moshi paper](https://arxiv.org/abs/2410.00037)) and outputs text tokens.
	The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.

	We release two models:
	- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
	- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.

	## Model Description

	Kyutai STT is a decoder-only model for streaming speech-to-text.
	It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream.
	The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.

	* Developed by: Kyutai
	* Model type: Streaming Speech-to-Text transcription.
	* Language(s) (NLP): English and French for `kyutai/stt-1b-en_fr`, English for `kyutai/stt-2.6b-en`
	* License: Model weights are licensed under CC-BY 4.0
	* Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/)

	## Uses

	### Direct Use

	The model can be used for streaming speech-to-text.
	It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes.
	The model produces transcripts with capitalization and punctuation.
	The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset.

	## How to Get Started with the Model

	See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).

	## Training Details

	### Training Data

	Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-en_fr`, we use an audio collection of 2.5 million hours of publicly available audio content.
	For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).

	For `kyutai/stt-2.6b-en`:

	- Finetuning stage: We then finetune the model on a collection of public datasets with
	ground-truth transcripts. This dataset contains 24000 hours of audio.

	- Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio.
	The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours).

	For `kyutai/stt-1b-en_fr`:

	- Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French).

	### Compute Infrastructure

	Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.

	## Model Card Authors

	Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez