File size: 5,875 Bytes
202b67b
 
 
 
d64ad97
202b67b
 
 
 
 
 
 
 
005de8e
202b67b
005de8e
202b67b
 
 
 
 
 
 
 
 
 
 
 
 
adc3fc5
202b67b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adc3fc5
202b67b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: cc-by-4.0
language:
  - en
library_name: transformers
tags:
  - audio
  - automatic-speech-recognition
---
# Model Card for Kyutai STT

This repo is meant to use the model with [Transformers](https://github.com/huggingface/transformers) 🤗

Starting with `transformers >= 4.53.0` and above, you can now run Kyutai STT natively!
```bash
pip install -U transformers
```

Inference:
```python
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en-trfs"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto")

# 2. load audio samples
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
inputs = processor(
    ds[0]["audio"]["array"],
)
inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
print(processor.batch_decode(output_tokens, skip_special_tokens=True))
```

Batched inference:
```python
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en-trfs"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device, torch_dtype="auto")

# 2. load audio samples
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
    print(output)
```

See also the [project page](https://kyutai.org/next/stt)
and the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).

This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR).
Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript,
our model starts to output the transcript as soon as a few seconds of audio become available.

## Model Details

The model architecture is a Transformer that consumes audio tokenized by Mimi (see [the Moshi paper](https://arxiv.org/abs/2410.00037)) and outputs text tokens.
The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens.

We release two models:
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.

## Model Description

Kyutai STT is a decoder-only model for streaming speech-to-text.
It leverages the multistream architecture of [Moshi](https://moshi.chat/) to model text stream based on the speech stream.
The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio.

* Developed by: Kyutai
* Model type: Streaming Speech-to-Text transcription.
* Language(s) (NLP): English and French for `kyutai/stt-1b-en_fr`, English for `kyutai/stt-2.6b-en`
* License: Model weights are licensed under CC-BY 4.0
* Repository: [GitHub](https://github.com/kyutai-labs/delayed-streams-modeling/)

## Uses

### Direct Use

The model can be used for streaming speech-to-text.
It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes.
The model produces transcripts with capitalization and punctuation.
The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset.

## How to Get Started with the Model

See the [GitHub repository](https://github.com/kyutai-labs/delayed-streams-modeling/).

## Training Details

### Training Data

Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-en_fr`, we use an audio collection of 2.5 million hours of publicly available audio content.
For this dataset, we obtained synthetic transcripts by running [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped).

For `kyutai/stt-2.6b-en`:

- Finetuning stage: We then finetune the model on a collection of public datasets with
ground-truth transcripts. This dataset contains 24000 hours of audio.

- Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio.
The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours).

For `kyutai/stt-1b-en_fr`:

- Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French).

### Compute Infrastructure

Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively.

## Model Card Authors

Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez