|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
--- |
|
# Model Card for Kyutai STT |
|
|
|
This repo is meant to use the model with [Transformers](https://github.com/huggingface/transformers) 🤗 |
|
|
|
Starting with `transformers >= 4.53.0` and above, you can now run Kyutai STT natively! |
|
```bash |
|
pip install -U transformers |
|
``` |
|
|
|
Inference: |
|
```python |
|
import torch |
|
from datasets import load_dataset, Audio |
|
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration |
|
|
|
# 1. load the model and the processor |
|
torch_device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_id = "kyutai/stt-2.6b-en-trfs" |
|
|
|
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id) |
|
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto") |
|
|
|
# 2. load audio samples |
|
ds = load_dataset( |
|
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" |
|
) |
|
ds = ds.cast_column("audio", Audio(sampling_rate=24000)) |
|
|
|
# 3. prepare the model inputs |
|
inputs = processor( |
|
ds[0]["audio"]["array"], |
|
) |
|
inputs.to(torch_device) |
|
|
|
# 4. infer the model |
|
output_tokens = model.generate(**inputs) |
|
|
|
# 5. decode the generated tokens |
|
print(processor.batch_decode(output_tokens, skip_special_tokens=True)) |
|
``` |
|
|
|
Batched inference: |
|
```python |
|
import torch |
|
from datasets import load_dataset, Audio |
|
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration |
|
|
|
# 1. load the model and the processor |
|
torch_device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model_id = "kyutai/stt-2.6b-en-trfs" |
|
|
|
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id) |
|
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto") |
|
|
|
# 2. load audio samples |
|
ds = load_dataset( |
|
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" |
|
) |
|
ds = ds.cast_column("audio", Audio(sampling_rate=24000)) |
|
|
|
# 3. prepare the model inputs |
|
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)] |
|
inputs = processor(audio_arrays, return_tensors="pt", padding=True) |
|
inputs = inputs.to(torch_device) |
|
|
|
# 4. infer the model |
|
output_tokens = model.generate(**inputs) |
|
|
|
# 5. decode the generated tokens |
|
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True) |
|
for output in decoded_outputs: |
|
print(output) |
|
``` |
|
|
|
See also the [project page](https://kyutai.org/next/stt) |
|
and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/). |
|
|
|
This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR). |
|
Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript, |
|
our model starts to output the transcript as soon as a few seconds of audio become available. |
|
|
|
## Model Details |
|
|
|
The model architecture is a Transformer that consumes audio tokenized by Mimi (see [the Moshi paper](https://arxiv.org/abs/2410.00037)) and outputs text tokens. |
|
The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens. |
|
|
|
We release two models: |
|
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad). |
|
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay. |
|
|
|
## Model Description |
|
|
|
Kyutai STT is a decoder-only model for streaming speech-to-text. |
|
It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream. |
|
The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. |
|
|
|
* Developed by: Kyutai |
|
* Model type: Streaming Speech-to-Text transcription. |
|
* Language(s) (NLP): English and French for `kyutai/stt-1b-en_fr`, English for `kyutai/stt-2.6b-en` |
|
* License: Model weights are licensed under CC-BY 4.0 |
|
* Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
The model can be used for streaming speech-to-text. |
|
It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes. |
|
The model produces transcripts with capitalization and punctuation. |
|
The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset. |
|
|
|
## How to Get Started with the Model |
|
|
|
See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/). |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-en_fr`, we use an audio collection of 2.5 million hours of publicly available audio content. |
|
For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped). |
|
|
|
For `kyutai/stt-2.6b-en`: |
|
|
|
- Finetuning stage: We then finetune the model on a collection of public datasets with |
|
ground-truth transcripts. This dataset contains 24000 hours of audio. |
|
|
|
- Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio. |
|
The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours). |
|
|
|
For `kyutai/stt-1b-en_fr`: |
|
|
|
- Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French). |
|
|
|
### Compute Infrastructure |
|
|
|
Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively. |
|
|
|
## Model Card Authors |
|
|
|
Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez |