Commit
5fbb969
verified
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files

Co-authored-by: sanchit-gandhi <sanchit-gandhi@users.noreply.huggingface.co>
Co-authored-by: pandora-s <pandora-s@users.noreply.huggingface.co>

Files changed (5) hide show
  1. .gitattributes +37 -0
  2. README.md +242 -0
  3. consolidated.safetensors +3 -0
  4. params.json +34 -0
  5. tekken.json +3 -0
.gitattributes ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tekken[[:space:]](19).json filter=lfs diff=lfs merge=lfs -text
37
+ tekken.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - it
8
+ - pt
9
+ - nl
10
+ - hi
11
+ license: apache-2.0
12
+ library_name: vllm
13
+ inference: false
14
+ base_model:
15
+ - mistralai/Mistral-Small-24B-Base-2501
16
+ extra_gated_description: >-
17
+ If you want to learn more about how we process your personal data, please read
18
+ our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
19
+ pipeline_tag: audio-text-to-text
20
+ ---
21
+
22
+ # Voxtral Small 1.0 (24B) - 2507
23
+
24
+ Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
25
+
26
+ Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral-2507).
27
+
28
+ ## Key Features
29
+
30
+ Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
31
+ - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
32
+ - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
33
+ - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
34
+ - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world鈥檚 most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
35
+ - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
36
+ - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1
37
+
38
+ ## Benchmark Results
39
+
40
+ ### Audio
41
+
42
+ Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
43
+
44
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
45
+
46
+
47
+ ### Text
48
+
49
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/9i9jpA2OLwhYut_wet836.png)
50
+
51
+ ## Usage
52
+
53
+ The model can be used with the following frameworks;
54
+ - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
55
+
56
+ **Notes**:
57
+
58
+ - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
59
+ - Multiple audios per message and multiple user turns with audio are supported
60
+ - Function calling is supported
61
+ - System prompts are not yet supported
62
+
63
+
64
+ ### vLLM (recommended)
65
+
66
+ We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
67
+
68
+ #### Installation
69
+
70
+ Make sure to install vllm from "main":
71
+
72
+ ```
73
+ pip install -U vllm\[audio\] \
74
+ --pre \
75
+ --extra-index-url https://wheels.vllm.ai/nightly
76
+ ```
77
+
78
+ Doing so should automatically install [`mistral_common >= 1.8.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.0).
79
+
80
+ To check:
81
+ ```
82
+ python -c "import mistral_common; print(mistral_common.__version__)"
83
+ ```
84
+
85
+ #### Offline
86
+
87
+ You can test that your vLLM setup works as expected by cloning the vLLM repo:
88
+
89
+ ```sh
90
+ git clone https://github.com/vllm-project/vllm && cd vllm
91
+ ```
92
+
93
+ and then running:
94
+
95
+ ```sh
96
+ python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
97
+ ```
98
+
99
+ #### Serve
100
+
101
+ We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
102
+
103
+ 1. Spin up a server:
104
+
105
+ ```
106
+ vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2
107
+ ```
108
+
109
+ **Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
110
+
111
+
112
+ 2. To ping the client you can use a simple Python snippet. See the following examples.
113
+
114
+
115
+ ### Audio Instruct
116
+
117
+ Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
118
+
119
+ Make sure that your client has `mistral-common` with audio installed:
120
+
121
+ ```sh
122
+ pip install --upgrade mistral_common\[audio\]
123
+ ```
124
+
125
+ <details>
126
+ <summary>Python snippet</summary>
127
+
128
+ ```py
129
+ from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
130
+ from mistral_common.audio import Audio
131
+ from huggingface_hub import hf_hub_download
132
+
133
+ from openai import OpenAI
134
+
135
+ # Modify OpenAI's API key and API base to use vLLM's API server.
136
+ openai_api_key = "EMPTY"
137
+ openai_api_base = "http://<your-server-host>:8000/v1"
138
+
139
+ client = OpenAI(
140
+ api_key=openai_api_key,
141
+ base_url=openai_api_base,
142
+ )
143
+
144
+ models = client.models.list()
145
+ model = models.data[0].id
146
+
147
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
148
+ bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
149
+
150
+ def file_to_chunk(file: str) -> AudioChunk:
151
+ audio = Audio.from_file(file, strict=False)
152
+ return AudioChunk.from_audio(audio)
153
+
154
+ text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
155
+ user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
156
+
157
+ print(30 * "=" + "USER 1" + 30 * "=")
158
+ print(text_chunk.text)
159
+ print("\n\n")
160
+
161
+ response = client.chat.completions.create(
162
+ model=model,
163
+ messages=[user_msg],
164
+ temperature=0.2,
165
+ top_p=0.95,
166
+ )
167
+ content = response.choices[0].message.content
168
+
169
+ print(30 * "=" + "BOT 1" + 30 * "=")
170
+ print(content)
171
+ print("\n\n")
172
+ # The model could give the following answer:
173
+ # ```L'orateur le plus inspirant est le pr茅sident.
174
+ # Il est plus inspirant parce qu'il parle de ses exp茅riences personnelles
175
+ # et de son optimisme pour l'avenir du pays.
176
+ # Il est diff茅rent de l'autre orateur car il ne parle pas de la m茅t茅o,
177
+ # mais plut么t de ses interactions avec les gens et de son r么le en tant que pr茅sident.```
178
+
179
+ messages = [
180
+ user_msg,
181
+ AssistantMessage(content=content).to_openai(),
182
+ UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
183
+ ]
184
+ print(30 * "=" + "USER 2" + 30 * "=")
185
+ print(messages[-1]["content"])
186
+ print("\n\n")
187
+
188
+ response = client.chat.completions.create(
189
+ model=model,
190
+ messages=messages,
191
+ temperature=0.2,
192
+ top_p=0.95,
193
+ )
194
+ content = response.choices[0].message.content
195
+ print(30 * "=" + "BOT 2" + 30 * "=")
196
+ print(content)
197
+ ```
198
+ </details>
199
+
200
+ #### Transcription
201
+
202
+ Voxtral-Small-24B-2507 has powerful transcription capabilities!
203
+
204
+ Make sure that your client has `mistral-common` with audio installed:
205
+
206
+ ```sh
207
+ pip install --upgrade mistral_common\[audio\]
208
+ ```
209
+
210
+ <details>
211
+ <summary>Python snippet</summary>
212
+
213
+ ```python
214
+ from mistral_common.protocol.transcription.request import TranscriptionRequest
215
+ from mistral_common.protocol.instruct.messages import RawAudio
216
+ from mistral_common.audio import Audio
217
+ from huggingface_hub import hf_hub_download
218
+
219
+ from openai import OpenAI
220
+
221
+ # Modify OpenAI's API key and API base to use vLLM's API server.
222
+ openai_api_key = "EMPTY"
223
+ openai_api_base = "http://<your-server-host>:8000/v1"
224
+
225
+ client = OpenAI(
226
+ api_key=openai_api_key,
227
+ base_url=openai_api_base,
228
+ )
229
+
230
+ models = client.models.list()
231
+ model = models.data[0].id
232
+
233
+ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
234
+ audio = Audio.from_file(obama_file, strict=False)
235
+
236
+ audio = RawAudio.from_audio(audio)
237
+ req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
238
+
239
+ response = client.audio.transcriptions.create(**req)
240
+ print(response)
241
+ ```
242
+ </details>
consolidated.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:421cd5988a614ac9ecf63eaf0db3572a36a6318dbb3c57cba24d5ef4d4abea7e
3
+ size 48519877672
params.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dim": 5120,
3
+ "n_layers": 40,
4
+ "head_dim": 128,
5
+ "hidden_dim": 32768,
6
+ "n_heads": 32,
7
+ "n_kv_heads": 8,
8
+ "rope_theta": 100000000.0,
9
+ "norm_eps": 1e-05,
10
+ "vocab_size": 131072,
11
+ "max_position_embeddings": 32768,
12
+ "multimodal": {
13
+ "whisper_model_args": {
14
+ "encoder_args": {
15
+ "dim": 1280,
16
+ "n_layers": 32,
17
+ "head_dim": 64,
18
+ "hidden_dim": 5120,
19
+ "n_heads": 20,
20
+ "vocab_size": 51866,
21
+ "max_source_positions": 1500,
22
+ "audio_encoding_args": {
23
+ "sampling_rate": 16000,
24
+ "num_mel_bins": 128,
25
+ "hop_length": 160,
26
+ "window_size": 400
27
+ }
28
+ },
29
+ "downsample_args": {
30
+ "downsample_factor": 4
31
+ }
32
+ }
33
+ }
34
+ }
tekken.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4aaf3836c2a5332f029ce85a7a62255c966f47b6797ef81dedd0ade9c862e4a8
3
+ size 14894206