README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
- library_name: vllm
3
  language:
4
  - en
5
  - fr
@@ -10,59 +9,56 @@ language:
10
  - nl
11
  - hi
12
  license: apache-2.0
 
13
  inference: false
14
  base_model:
15
  - mistralai/Mistral-Small-24B-Base-2501
16
  extra_gated_description: >-
17
  If you want to learn more about how we process your personal data, please read
18
  our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
19
- tags:
20
- - vllm
21
  pipeline_tag: audio-text-to-text
22
  ---
23
 
24
- # Voxtral Small 1.0 (24B) - 2507
 
 
25
 
26
- Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
 
 
27
 
28
- Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral) and our [research paper](https://arxiv.org/abs/2507.13264).
29
 
30
  ## Key Features
31
 
32
  Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
33
- - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
34
- - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
35
- - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
36
- - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
37
- - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
38
- - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1
39
 
40
  ## Benchmark Results
41
 
42
  ### Audio
43
 
44
- Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
45
 
46
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
47
 
 
48
 
49
- ### Text
50
 
51
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/uDg3hKDwJowsNuj-yyt2T.png)
52
 
53
  ## Usage
54
 
55
  The model can be used with the following frameworks;
56
  - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
57
- - [`Transformers` 🤗](https://github.com/huggingface/transformers): See [here](#transformers-🤗)
58
-
59
- **Notes**:
60
 
61
- - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
62
- - Multiple audios per message and multiple user turns with audio are supported
63
- - Function calling is supported
64
- - System prompts are not yet supported
65
 
 
66
 
67
  ### vLLM (recommended)
68
 
@@ -70,32 +66,20 @@ We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
70
 
71
  #### Installation
72
 
73
- Make sure to install vllm >= `0.10.0`, we recommend using uv
74
 
75
  ```
76
- uv pip install -U "vllm[audio]" --system
77
  ```
78
 
79
- Doing so should automatically install [`mistral_common >= 1.8.1`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.1).
80
 
81
  To check:
82
  ```
83
  python -c "import mistral_common; print(mistral_common.__version__)"
84
  ```
85
 
86
- #### Offline
87
-
88
- You can test that your vLLM setup works as expected by cloning the vLLM repo:
89
-
90
- ```sh
91
- git clone https://github.com/vllm-project/vllm && cd vllm
92
- ```
93
-
94
- and then running:
95
-
96
- ```sh
97
- python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
98
- ```
99
 
100
  #### Serve
101
 
@@ -104,7 +88,7 @@ We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
104
  1. Spin up a server:
105
 
106
  ```
107
- vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice
108
  ```
109
 
110
  **Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
@@ -117,16 +101,60 @@ vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_fo
117
 
118
  Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
119
 
120
- Make sure that your client has `mistral-common` with audio installed:
 
121
 
122
- ```sh
123
- pip install --upgrade mistral_common\[audio\]
124
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  <details>
127
  <summary>Python snippet</summary>
128
 
129
  ```py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
131
  from mistral_common.audio import Audio
132
  from huggingface_hub import hf_hub_download
@@ -135,7 +163,7 @@ from openai import OpenAI
135
 
136
  # Modify OpenAI's API key and API base to use vLLM's API server.
137
  openai_api_key = "EMPTY"
138
- openai_api_base = "http://<your-server-host>:8000/v1"
139
 
140
  client = OpenAI(
141
  api_key=openai_api_key,
@@ -152,7 +180,7 @@ def file_to_chunk(file: str) -> AudioChunk:
152
  audio = Audio.from_file(file, strict=False)
153
  return AudioChunk.from_audio(audio)
154
 
155
- text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
156
  user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
157
 
158
  print(30 * "=" + "USER 1" + 30 * "=")
@@ -162,20 +190,16 @@ print("\n\n")
162
  response = client.chat.completions.create(
163
  model=model,
164
  messages=[user_msg],
165
- temperature=0.2,
166
- top_p=0.95,
167
  )
168
  content = response.choices[0].message.content
169
 
170
  print(30 * "=" + "BOT 1" + 30 * "=")
171
  print(content)
172
  print("\n\n")
173
- # The model could give the following answer:
174
- # ```L'orateur le plus inspirant est le président.
175
- # Il est plus inspirant parce qu'il parle de ses expériences personnelles
176
- # et de son optimisme pour l'avenir du pays.
177
- # Il est différent de l'autre orateur car il ne parle pas de la météo,
178
- # mais plutôt de ses interactions avec les gens et de son rôle en tant que président.```
179
 
180
  messages = [
181
  user_msg,
@@ -190,28 +214,16 @@ response = client.chat.completions.create(
190
  model=model,
191
  messages=messages,
192
  temperature=0.2,
193
- top_p=0.95,
194
  )
195
  content = response.choices[0].message.content
196
  print(30 * "=" + "BOT 2" + 30 * "=")
197
  print(content)
198
  ```
199
- </details>
200
 
201
- #### Transcription
202
-
203
- Voxtral-Small-24B-2507 has powerful transcription capabilities!
204
-
205
- Make sure that your client has `mistral-common` with audio installed:
206
-
207
- ```sh
208
- pip install --upgrade mistral_common\[audio\]
209
- ```
210
 
211
- <details>
212
- <summary>Python snippet</summary>
213
-
214
- ```python
215
  from mistral_common.protocol.transcription.request import TranscriptionRequest
216
  from mistral_common.protocol.instruct.messages import RawAudio
217
  from mistral_common.audio import Audio
@@ -221,7 +233,7 @@ from openai import OpenAI
221
 
222
  # Modify OpenAI's API key and API base to use vLLM's API server.
223
  openai_api_key = "EMPTY"
224
- openai_api_base = "http://<your-server-host>:8000/v1"
225
 
226
  client = OpenAI(
227
  api_key=openai_api_key,
@@ -235,385 +247,10 @@ obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo
235
  audio = Audio.from_file(obama_file, strict=False)
236
 
237
  audio = RawAudio.from_audio(audio)
238
- req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
239
 
240
  response = client.audio.transcriptions.create(**req)
241
  print(response)
242
  ```
243
- </details>
244
-
245
- #### Function Calling
246
-
247
- Voxtral has some experimental function calling support. You can try as shown below.
248
-
249
- Make sure that your client has `mistral-common` with audio installed:
250
-
251
- ```sh
252
- pip install --upgrade mistral_common\[audio\]
253
- ```
254
-
255
- <details>
256
- <summary>Python snippet</summary>
257
-
258
- ```python
259
- from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage, TextChunk
260
- from mistral_common.protocol.transcription.request import TranscriptionRequest
261
- from mistral_common.protocol.instruct.tool_calls import Function, Tool
262
-
263
- from mistral_common.audio import Audio
264
- from huggingface_hub import hf_hub_download
265
-
266
- from openai import OpenAI
267
-
268
- # Modify OpenAI's API key and API base to use vLLM's API server.
269
- openai_api_key = "EMPTY"
270
- openai_api_base = "http://<your-server-host>:8000/v1"
271
-
272
- client = OpenAI(
273
- api_key=openai_api_key,
274
- base_url=openai_api_base,
275
- )
276
-
277
- models = client.models.list()
278
- model = models.data[0].id
279
-
280
- tool = Tool(
281
- function=Function(
282
- name="get_current_weather",
283
- description="Get the current weather",
284
- parameters={
285
- "type": "object",
286
- "properties": {
287
- "location": {
288
- "type": "string",
289
- "description": "The city and state, e.g. San Francisco, CA",
290
- },
291
- "format": {
292
- "type": "string",
293
- "enum": ["celsius", "fahrenheit"],
294
- "description": "The temperature unit to use. Infer this from the user's location.",
295
- },
296
- },
297
- "required": ["location", "format"],
298
- },
299
- )
300
- )
301
- tools = [tool.to_openai()]
302
-
303
-
304
- weather_like = hf_hub_download("patrickvonplaten/audio_samples", "fn_calling.wav", repo_type="dataset")
305
-
306
- def file_to_chunk(file: str) -> AudioChunk:
307
- audio = Audio.from_file(file, strict=False)
308
- return AudioChunk.from_audio(audio)
309
-
310
- audio_chunk = file_to_chunk(weather_like)
311
-
312
- print(30 * "=" + "Transcription" + 30 * "=")
313
- req = TranscriptionRequest(model=model, audio=audio_chunk.input_audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
314
- response = client.audio.transcriptions.create(**req)
315
- print(response.text) # How is the weather in Madrid at the moment?
316
- print("\n")
317
-
318
-
319
- print(30 * "=" + "Function calling" + 30 * "=")
320
- audio_chunk = file_to_chunk(weather_like)
321
- user_msg = UserMessage(content=[audio_chunk]).to_openai()
322
- response = client.chat.completions.create(
323
- model=model,
324
- messages=[user_msg],
325
- temperature=0.2,
326
- top_p=0.95,
327
- tools=[tool.to_openai()]
328
- )
329
- print(30 * "=" + "BOT 1" + 30 * "=")
330
- print(response.choices[0].message.tool_calls)
331
- print("\n\n")
332
- ```
333
- </details>
334
-
335
- ### Transformers 🤗
336
 
337
- Starting with `transformers >= 4.54.0` and above, you can run Voxtral natively!
338
 
339
- Install Transformers:
340
- ```bash
341
- pip install -U transformers
342
- ```
343
-
344
- Make sure to have `mistral-common >= 1.8.1` installed with audio dependencies:
345
- ```bash
346
- pip install --upgrade "mistral-common[audio]"
347
- ```
348
-
349
- #### Audio Instruct
350
-
351
- <details>
352
- <summary>➡️ multi-audio + text instruction</summary>
353
-
354
- ```python
355
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
356
- import torch
357
-
358
- device = "cuda"
359
- repo_id = "mistralai/Voxtral-Small-24B-2507"
360
-
361
- processor = AutoProcessor.from_pretrained(repo_id)
362
- model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
363
-
364
- conversation = [
365
- {
366
- "role": "user",
367
- "content": [
368
- {
369
- "type": "audio",
370
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
371
- },
372
- {
373
- "type": "audio",
374
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
375
- },
376
- {"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
377
- ],
378
- }
379
- ]
380
-
381
- inputs = processor.apply_chat_template(conversation)
382
- inputs = inputs.to(device, dtype=torch.bfloat16)
383
-
384
- outputs = model.generate(**inputs, max_new_tokens=500)
385
- decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
386
-
387
- print("\nGenerated response:")
388
- print("=" * 80)
389
- print(decoded_outputs[0])
390
- print("=" * 80)
391
- ```
392
- </details>
393
-
394
-
395
- <details>
396
- <summary>➡️ multi-turn</summary>
397
-
398
- ```python
399
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
400
- import torch
401
-
402
- device = "cuda"
403
- repo_id = "mistralai/Voxtral-Small-24B-2507"
404
-
405
- processor = AutoProcessor.from_pretrained(repo_id)
406
- model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
407
-
408
- conversation = [
409
- {
410
- "role": "user",
411
- "content": [
412
- {
413
- "type": "audio",
414
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
415
- },
416
- {
417
- "type": "audio",
418
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
419
- },
420
- {"type": "text", "text": "Describe briefly what you can hear."},
421
- ],
422
- },
423
- {
424
- "role": "assistant",
425
- "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
426
- },
427
- {
428
- "role": "user",
429
- "content": [
430
- {
431
- "type": "audio",
432
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
433
- },
434
- {"type": "text", "text": "Ok, now compare this new audio with the previous one."},
435
- ],
436
- },
437
- ]
438
-
439
- inputs = processor.apply_chat_template(conversation)
440
- inputs = inputs.to(device, dtype=torch.bfloat16)
441
-
442
- outputs = model.generate(**inputs, max_new_tokens=500)
443
- decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
444
-
445
- print("\nGenerated response:")
446
- print("=" * 80)
447
- print(decoded_outputs[0])
448
- print("=" * 80)
449
- ```
450
- </details>
451
-
452
-
453
- <details>
454
- <summary>➡️ text only</summary>
455
-
456
- ```python
457
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
458
- import torch
459
-
460
- device = "cuda"
461
- repo_id = "mistralai/Voxtral-Small-24B-2507"
462
-
463
- processor = AutoProcessor.from_pretrained(repo_id)
464
- model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
465
-
466
- conversation = [
467
- {
468
- "role": "user",
469
- "content": [
470
- {
471
- "type": "text",
472
- "text": "Why should AI models be open-sourced?",
473
- },
474
- ],
475
- }
476
- ]
477
-
478
- inputs = processor.apply_chat_template(conversation)
479
- inputs = inputs.to(device, dtype=torch.bfloat16)
480
-
481
- outputs = model.generate(**inputs, max_new_tokens=500)
482
- decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
483
-
484
- print("\nGenerated response:")
485
- print("=" * 80)
486
- print(decoded_outputs[0])
487
- print("=" * 80)
488
- ```
489
- </details>
490
-
491
-
492
- <details>
493
- <summary>➡️ audio only</summary>
494
-
495
- ```python
496
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
497
- import torch
498
-
499
- device = "cuda"
500
- repo_id = "mistralai/Voxtral-Small-24B-2507"
501
-
502
- processor = AutoProcessor.from_pretrained(repo_id)
503
- model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
504
-
505
- conversation = [
506
- {
507
- "role": "user",
508
- "content": [
509
- {
510
- "type": "audio",
511
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
512
- },
513
- ],
514
- }
515
- ]
516
-
517
- inputs = processor.apply_chat_template(conversation)
518
- inputs = inputs.to(device, dtype=torch.bfloat16)
519
-
520
- outputs = model.generate(**inputs, max_new_tokens=500)
521
- decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
522
-
523
- print("\nGenerated response:")
524
- print("=" * 80)
525
- print(decoded_outputs[0])
526
- print("=" * 80)
527
- ```
528
- </details>
529
-
530
-
531
- <details>
532
- <summary>➡️ batched inference</summary>
533
-
534
- ```python
535
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
536
- import torch
537
-
538
- device = "cuda"
539
- repo_id = "mistralai/Voxtral-Small-24B-2507"
540
-
541
- processor = AutoProcessor.from_pretrained(repo_id)
542
- model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
543
-
544
- conversations = [
545
- [
546
- {
547
- "role": "user",
548
- "content": [
549
- {
550
- "type": "audio",
551
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
552
- },
553
- {
554
- "type": "audio",
555
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
556
- },
557
- {
558
- "type": "text",
559
- "text": "Who's speaking in the speach and what city's weather is being discussed?",
560
- },
561
- ],
562
- }
563
- ],
564
- [
565
- {
566
- "role": "user",
567
- "content": [
568
- {
569
- "type": "audio",
570
- "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
571
- },
572
- {"type": "text", "text": "What can you tell me about this audio?"},
573
- ],
574
- }
575
- ],
576
- ]
577
-
578
- inputs = processor.apply_chat_template(conversations)
579
- inputs = inputs.to(device, dtype=torch.bfloat16)
580
-
581
- outputs = model.generate(**inputs, max_new_tokens=500)
582
- decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
583
-
584
- print("\nGenerated responses:")
585
- print("=" * 80)
586
- for decoded_output in decoded_outputs:
587
- print(decoded_output)
588
- print("=" * 80)
589
- ```
590
- </details>
591
-
592
- #### Transcription
593
-
594
- <details>
595
- <summary>➡️ transcribe</summary>
596
-
597
- ```python
598
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
599
- import torch
600
-
601
- device = "cuda"
602
- repo_id = "mistralai/Voxtral-Small-24B-2507"
603
-
604
- processor = AutoProcessor.from_pretrained(repo_id)
605
- model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
606
-
607
- inputs = processor.apply_transcription_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
608
- inputs = inputs.to(device, dtype=torch.bfloat16)
609
-
610
- outputs = model.generate(**inputs, max_new_tokens=500)
611
- decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
612
-
613
- print("\nGenerated responses:")
614
- print("=" * 80)
615
- for decoded_output in decoded_outputs:
616
- print(decoded_output)
617
- print("=" * 80)
618
- ```
619
- </details>
 
1
  ---
 
2
  language:
3
  - en
4
  - fr
 
9
  - nl
10
  - hi
11
  license: apache-2.0
12
+ library_name: vllm
13
  inference: false
14
  base_model:
15
  - mistralai/Mistral-Small-24B-Base-2501
16
  extra_gated_description: >-
17
  If you want to learn more about how we process your personal data, please read
18
  our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
 
 
19
  pipeline_tag: audio-text-to-text
20
  ---
21
 
22
+ # Voxtral Small 24B-2507
23
+
24
+ Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription and understanding.
25
 
26
+ Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral-2507).
27
+
28
+ Both Voxtral models go beyond transcription with capabilities that include:
29
 
 
30
 
31
  ## Key Features
32
 
33
  Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
34
+ - **Long-form context**: with a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
35
+ - **Built-in Q&A and summarization**: Supports asking questions directly about the audio content or generating structured summaries, without the need to chain separate ASR and language models
36
+ - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, to name a few), helping teams serve global audiences with a single system
37
+ - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents, turning voice interactions into actionable system commands without intermediate parsing steps.
38
+ - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3
 
39
 
40
  ## Benchmark Results
41
 
42
  ### Audio
43
 
44
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/FKBy7KZAoTR52A_ht-ZpL.png)
45
 
46
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/-DcHsAjwuY6kbduC5XBH2.png)
47
 
48
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/7tJgqCPczuJZ4cj2PZe59.png)
49
 
50
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
51
 
52
+ ### Text
53
 
54
  ## Usage
55
 
56
  The model can be used with the following frameworks;
57
  - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
 
 
 
58
 
59
+ **Note 1**: We recommend using a relatively low temperature, such as `temperature=0.15`.
 
 
 
60
 
61
+ **Note 2**: Make sure to add a system prompt to the model to best tailor it to your needs.
62
 
63
  ### vLLM (recommended)
64
 
 
66
 
67
  #### Installation
68
 
69
+ Make sure to install [`vLLM >= 0.#.#`](https://github.com/vllm-project/vllm/releases/tag/v0.#.#):
70
 
71
  ```
72
+ pip install vllm --upgrade
73
  ```
74
 
75
+ Doing so should automatically install [`mistral_common >= 1.#.#`](https://github.com/mistralai/mistral-common/releases/tag/v1.#.#).
76
 
77
  To check:
78
  ```
79
  python -c "import mistral_common; print(mistral_common.__version__)"
80
  ```
81
 
82
+ You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest/images/sha256-de9032a92ffea7b5c007dad80b38fd44aac11eddc31c435f8e52f3b7404bbf39).
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  #### Serve
85
 
 
88
  1. Spin up a server:
89
 
90
  ```
91
+ vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --tensor-parallel-size 2
92
  ```
93
 
94
  **Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
 
101
 
102
  Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
103
 
104
+ <details>
105
+ <summary>Python snippet</summary>
106
 
107
+ ```py
108
+ TODO
109
  ```
110
+ </details>
111
+
112
+ #### Transcription
113
+
114
+ Voxtral-Small-24B-2507 has powerfull transcription capabilities!
115
+
116
+ <details>
117
+ <summary>Python snippet</summary>
118
+
119
+ ```python
120
+ TODO
121
+ ```
122
+ </details>
123
+
124
+ #### Function calling
125
+
126
+ Voxtral-Small-24B-2507 is excellent at function / tool calling tasks via vLLM. *E.g.:*
127
 
128
  <details>
129
  <summary>Python snippet</summary>
130
 
131
  ```py
132
+ ```
133
+
134
+ </details>
135
+
136
+ # ORIGINAL
137
+
138
+ ```
139
+ VLLM_USE_PRECOMPILED=1 pip install --editable .\[audio\]
140
+ ```
141
+
142
+ of: https://github.com/vllm-project/vllm/pull/20970#pullrequestreview-3019578541
143
+
144
+ # Examples
145
+
146
+ ## Client/Server
147
+
148
+ ### Server
149
+
150
+ ```sh
151
+ vllm serve mistralai/voxtral-small --tokenizer_mode mistral --config_format mistral --load_format mistral --max_model_len 32768
152
+ ```
153
+
154
+ ### Client - Chat
155
+
156
+ ```py
157
+ #!/usr/bin/env python3
158
  from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
159
  from mistral_common.audio import Audio
160
  from huggingface_hub import hf_hub_download
 
163
 
164
  # Modify OpenAI's API key and API base to use vLLM's API server.
165
  openai_api_key = "EMPTY"
166
+ openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"
167
 
168
  client = OpenAI(
169
  api_key=openai_api_key,
 
180
  audio = Audio.from_file(file, strict=False)
181
  return AudioChunk.from_audio(audio)
182
 
183
+ text_chunk = TextChunk(text="Which speaker do you prefer between the two? Why? How are they different from each other?")
184
  user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
185
 
186
  print(30 * "=" + "USER 1" + 30 * "=")
 
190
  response = client.chat.completions.create(
191
  model=model,
192
  messages=[user_msg],
193
+ temperature=0.0,
194
+ max_tokens=32768,
195
  )
196
  content = response.choices[0].message.content
197
 
198
  print(30 * "=" + "BOT 1" + 30 * "=")
199
  print(content)
200
  print("\n\n")
201
+ # "The speaker who delivers the farewell address is more engaging and inspiring. They express gratitude and optimism, emphasizing the importance of self-government and citizenship. They also share personal experiences and observations, making the speech more relatable and heartfelt. In contrast, the second speaker provides factual information about the weather in Barcelona, which is less engaging and lacks the emotional depth of the first speaker's address."
202
+ #
 
 
 
 
203
 
204
  messages = [
205
  user_msg,
 
214
  model=model,
215
  messages=messages,
216
  temperature=0.2,
217
+ max_tokens=32768,
218
  )
219
  content = response.choices[0].message.content
220
  print(30 * "=" + "BOT 2" + 30 * "=")
221
  print(content)
222
  ```
 
223
 
224
+ ### Client - Transcribe
 
 
 
 
 
 
 
 
225
 
226
+ ```py
 
 
 
227
  from mistral_common.protocol.transcription.request import TranscriptionRequest
228
  from mistral_common.protocol.instruct.messages import RawAudio
229
  from mistral_common.audio import Audio
 
233
 
234
  # Modify OpenAI's API key and API base to use vLLM's API server.
235
  openai_api_key = "EMPTY"
236
+ openai_api_base = "http://slurm-h100-reserved-rno-199-087:8000/v1"
237
 
238
  client = OpenAI(
239
  api_key=openai_api_key,
 
247
  audio = Audio.from_file(obama_file, strict=False)
248
 
249
  audio = RawAudio.from_audio(audio)
250
+ req = TranscriptionRequest(model=model, audio=audio, language="en").to_openai(exclude=("top_p", "seed"))
251
 
252
  response = client.audio.transcriptions.create(**req)
253
  print(response)
254
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
255
 
 
256
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json DELETED
@@ -1,53 +0,0 @@
1
- {
2
- "architectures": [
3
- "VoxtralForConditionalGeneration"
4
- ],
5
- "audio_config": {
6
- "activation_dropout": 0.0,
7
- "activation_function": "gelu",
8
- "attention_dropout": 0.0,
9
- "dropout": 0.0,
10
- "head_dim": 64,
11
- "hidden_size": 1280,
12
- "initializer_range": 0.02,
13
- "intermediate_size": 5120,
14
- "layerdrop": 0.0,
15
- "max_source_positions": 1500,
16
- "model_type": "voxtral_encoder",
17
- "num_attention_heads": 20,
18
- "num_hidden_layers": 32,
19
- "num_key_value_heads": 20,
20
- "num_mel_bins": 128,
21
- "scale_embedding": false,
22
- "vocab_size": 51866
23
- },
24
- "audio_token_id": 24,
25
- "hidden_size": 5120,
26
- "model_type": "voxtral",
27
- "projector_hidden_act": "gelu",
28
- "text_config": {
29
- "attention_bias": false,
30
- "attention_dropout": 0.0,
31
- "head_dim": 128,
32
- "hidden_act": "silu",
33
- "hidden_size": 5120,
34
- "initializer_range": 0.02,
35
- "intermediate_size": 32768,
36
- "max_position_embeddings": 131072,
37
- "mlp_bias": false,
38
- "model_type": "llama",
39
- "num_attention_heads": 32,
40
- "num_hidden_layers": 40,
41
- "num_key_value_heads": 8,
42
- "pretraining_tp": 1,
43
- "rms_norm_eps": 1e-05,
44
- "rope_scaling": null,
45
- "rope_theta": 100000000.0,
46
- "sliding_window": null,
47
- "use_cache": true,
48
- "vocab_size": 131072
49
- },
50
- "torch_dtype": "bfloat16",
51
- "transformers_version": "4.54.0.dev0",
52
- "vocab_size": 131072
53
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generation_config.json DELETED
@@ -1,6 +0,0 @@
1
- {
2
- "bos_token_id": 1,
3
- "eos_token_id": 2,
4
- "pad_token_id": 11,
5
- "transformers_version": "4.54.0.dev0"
6
- }
 
 
 
 
 
 
 
model-00001-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:9e4c3fde4ef4a1141b4b06f2125c63bb2032f71a7985fafb4b3ae8243ca17be2
3
- size 4947893992
 
 
 
 
model-00002-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:6998429899fc08e533b7ef0bb65a66d0c744a490c31b5210ea5f61e62e8fb997
3
- size 4781593336
 
 
 
 
model-00003-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b692e60c5d9acfd5b6e7fd2c0337349a421bb687c8cef14627469f744efa83df
3
- size 4781593344
 
 
 
 
model-00004-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:0c34b0b04a59f62c3d1f320803f421c45d51dbf917f4e3f90eda957640ae85da
3
- size 4886472248
 
 
 
 
model-00005-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:bcea8514dd1330b2478deeff5b48b6788c1e6fbd1c6d5034faeeee5c066b6288
3
- size 4781593376
 
 
 
 
model-00006-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:3e415b6a5cc6f266acdd8aeba526288abd8345ebaa4022ab0e26ea30eaaffa91
3
- size 4781593368
 
 
 
 
model-00007-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:720bbe6314e0b69b97fd13e07ffdef4953dc5aec3b4b91c15a9c1d40d7b1c7fc
3
- size 4886472248
 
 
 
 
model-00008-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:85c590b7d4d6c843dc083b3048f5b00cda7ee7ddfeb18a3732ff2db65926a6eb
3
- size 4781593376
 
 
 
 
model-00009-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:fc973532e7f8faea778ed22aac9867b84e1d4bb4ad9c9a50744fc1eeeb6c9115
3
- size 4781593368
 
 
 
 
model-00010-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:365913969c7a5c93cc2d9e19b6f9d4128e675db8a1941c2436f0a2a2c64870e2
3
- size 3670112232
 
 
 
 
model-00011-of-00011.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:2515d18911999b503000115208d6c5ae46777223227c18a170cbff3095aaf697
3
- size 1447035256
 
 
 
 
model.safetensors.index.json DELETED
@@ -1,860 +0,0 @@
1
- {
2
- "metadata": {
3
- "total_parameters": 24261800960,
4
- "total_size": 48527441920
5
- },
6
- "weight_map": {
7
- "audio_tower.conv1.bias": "model-00001-of-00011.safetensors",
8
- "audio_tower.conv1.weight": "model-00001-of-00011.safetensors",
9
- "audio_tower.conv2.bias": "model-00001-of-00011.safetensors",
10
- "audio_tower.conv2.weight": "model-00001-of-00011.safetensors",
11
- "audio_tower.embed_positions.weight": "model-00001-of-00011.safetensors",
12
- "audio_tower.layer_norm.bias": "model-00001-of-00011.safetensors",
13
- "audio_tower.layer_norm.weight": "model-00001-of-00011.safetensors",
14
- "audio_tower.layers.0.fc1.bias": "model-00001-of-00011.safetensors",
15
- "audio_tower.layers.0.fc1.weight": "model-00001-of-00011.safetensors",
16
- "audio_tower.layers.0.fc2.bias": "model-00001-of-00011.safetensors",
17
- "audio_tower.layers.0.fc2.weight": "model-00001-of-00011.safetensors",
18
- "audio_tower.layers.0.final_layer_norm.bias": "model-00001-of-00011.safetensors",
19
- "audio_tower.layers.0.final_layer_norm.weight": "model-00001-of-00011.safetensors",
20
- "audio_tower.layers.0.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
21
- "audio_tower.layers.0.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
22
- "audio_tower.layers.0.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
23
- "audio_tower.layers.0.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
24
- "audio_tower.layers.0.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
25
- "audio_tower.layers.0.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
26
- "audio_tower.layers.0.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
27
- "audio_tower.layers.0.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
28
- "audio_tower.layers.0.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
29
- "audio_tower.layers.1.fc1.bias": "model-00001-of-00011.safetensors",
30
- "audio_tower.layers.1.fc1.weight": "model-00001-of-00011.safetensors",
31
- "audio_tower.layers.1.fc2.bias": "model-00001-of-00011.safetensors",
32
- "audio_tower.layers.1.fc2.weight": "model-00001-of-00011.safetensors",
33
- "audio_tower.layers.1.final_layer_norm.bias": "model-00001-of-00011.safetensors",
34
- "audio_tower.layers.1.final_layer_norm.weight": "model-00001-of-00011.safetensors",
35
- "audio_tower.layers.1.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
36
- "audio_tower.layers.1.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
37
- "audio_tower.layers.1.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
38
- "audio_tower.layers.1.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
39
- "audio_tower.layers.1.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
40
- "audio_tower.layers.1.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
41
- "audio_tower.layers.1.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
42
- "audio_tower.layers.1.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
43
- "audio_tower.layers.1.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
44
- "audio_tower.layers.10.fc1.bias": "model-00001-of-00011.safetensors",
45
- "audio_tower.layers.10.fc1.weight": "model-00001-of-00011.safetensors",
46
- "audio_tower.layers.10.fc2.bias": "model-00001-of-00011.safetensors",
47
- "audio_tower.layers.10.fc2.weight": "model-00001-of-00011.safetensors",
48
- "audio_tower.layers.10.final_layer_norm.bias": "model-00001-of-00011.safetensors",
49
- "audio_tower.layers.10.final_layer_norm.weight": "model-00001-of-00011.safetensors",
50
- "audio_tower.layers.10.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
51
- "audio_tower.layers.10.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
52
- "audio_tower.layers.10.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
53
- "audio_tower.layers.10.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
54
- "audio_tower.layers.10.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
55
- "audio_tower.layers.10.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
56
- "audio_tower.layers.10.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
57
- "audio_tower.layers.10.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
58
- "audio_tower.layers.10.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
59
- "audio_tower.layers.11.fc1.bias": "model-00001-of-00011.safetensors",
60
- "audio_tower.layers.11.fc1.weight": "model-00001-of-00011.safetensors",
61
- "audio_tower.layers.11.fc2.bias": "model-00001-of-00011.safetensors",
62
- "audio_tower.layers.11.fc2.weight": "model-00001-of-00011.safetensors",
63
- "audio_tower.layers.11.final_layer_norm.bias": "model-00001-of-00011.safetensors",
64
- "audio_tower.layers.11.final_layer_norm.weight": "model-00001-of-00011.safetensors",
65
- "audio_tower.layers.11.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
66
- "audio_tower.layers.11.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
67
- "audio_tower.layers.11.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
68
- "audio_tower.layers.11.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
69
- "audio_tower.layers.11.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
70
- "audio_tower.layers.11.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
71
- "audio_tower.layers.11.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
72
- "audio_tower.layers.11.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
73
- "audio_tower.layers.11.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
74
- "audio_tower.layers.12.fc1.bias": "model-00001-of-00011.safetensors",
75
- "audio_tower.layers.12.fc1.weight": "model-00001-of-00011.safetensors",
76
- "audio_tower.layers.12.fc2.bias": "model-00001-of-00011.safetensors",
77
- "audio_tower.layers.12.fc2.weight": "model-00001-of-00011.safetensors",
78
- "audio_tower.layers.12.final_layer_norm.bias": "model-00001-of-00011.safetensors",
79
- "audio_tower.layers.12.final_layer_norm.weight": "model-00001-of-00011.safetensors",
80
- "audio_tower.layers.12.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
81
- "audio_tower.layers.12.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
82
- "audio_tower.layers.12.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
83
- "audio_tower.layers.12.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
84
- "audio_tower.layers.12.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
85
- "audio_tower.layers.12.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
86
- "audio_tower.layers.12.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
87
- "audio_tower.layers.12.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
88
- "audio_tower.layers.12.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
89
- "audio_tower.layers.13.fc1.bias": "model-00001-of-00011.safetensors",
90
- "audio_tower.layers.13.fc1.weight": "model-00001-of-00011.safetensors",
91
- "audio_tower.layers.13.fc2.bias": "model-00001-of-00011.safetensors",
92
- "audio_tower.layers.13.fc2.weight": "model-00001-of-00011.safetensors",
93
- "audio_tower.layers.13.final_layer_norm.bias": "model-00001-of-00011.safetensors",
94
- "audio_tower.layers.13.final_layer_norm.weight": "model-00001-of-00011.safetensors",
95
- "audio_tower.layers.13.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
96
- "audio_tower.layers.13.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
97
- "audio_tower.layers.13.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
98
- "audio_tower.layers.13.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
99
- "audio_tower.layers.13.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
100
- "audio_tower.layers.13.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
101
- "audio_tower.layers.13.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
102
- "audio_tower.layers.13.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
103
- "audio_tower.layers.13.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
104
- "audio_tower.layers.14.fc1.bias": "model-00001-of-00011.safetensors",
105
- "audio_tower.layers.14.fc1.weight": "model-00001-of-00011.safetensors",
106
- "audio_tower.layers.14.fc2.bias": "model-00001-of-00011.safetensors",
107
- "audio_tower.layers.14.fc2.weight": "model-00001-of-00011.safetensors",
108
- "audio_tower.layers.14.final_layer_norm.bias": "model-00001-of-00011.safetensors",
109
- "audio_tower.layers.14.final_layer_norm.weight": "model-00001-of-00011.safetensors",
110
- "audio_tower.layers.14.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
111
- "audio_tower.layers.14.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
112
- "audio_tower.layers.14.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
113
- "audio_tower.layers.14.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
114
- "audio_tower.layers.14.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
115
- "audio_tower.layers.14.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
116
- "audio_tower.layers.14.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
117
- "audio_tower.layers.14.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
118
- "audio_tower.layers.14.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
119
- "audio_tower.layers.15.fc1.bias": "model-00001-of-00011.safetensors",
120
- "audio_tower.layers.15.fc1.weight": "model-00001-of-00011.safetensors",
121
- "audio_tower.layers.15.fc2.bias": "model-00001-of-00011.safetensors",
122
- "audio_tower.layers.15.fc2.weight": "model-00001-of-00011.safetensors",
123
- "audio_tower.layers.15.final_layer_norm.bias": "model-00001-of-00011.safetensors",
124
- "audio_tower.layers.15.final_layer_norm.weight": "model-00001-of-00011.safetensors",
125
- "audio_tower.layers.15.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
126
- "audio_tower.layers.15.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
127
- "audio_tower.layers.15.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
128
- "audio_tower.layers.15.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
129
- "audio_tower.layers.15.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
130
- "audio_tower.layers.15.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
131
- "audio_tower.layers.15.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
132
- "audio_tower.layers.15.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
133
- "audio_tower.layers.15.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
134
- "audio_tower.layers.16.fc1.bias": "model-00001-of-00011.safetensors",
135
- "audio_tower.layers.16.fc1.weight": "model-00001-of-00011.safetensors",
136
- "audio_tower.layers.16.fc2.bias": "model-00001-of-00011.safetensors",
137
- "audio_tower.layers.16.fc2.weight": "model-00001-of-00011.safetensors",
138
- "audio_tower.layers.16.final_layer_norm.bias": "model-00001-of-00011.safetensors",
139
- "audio_tower.layers.16.final_layer_norm.weight": "model-00001-of-00011.safetensors",
140
- "audio_tower.layers.16.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
141
- "audio_tower.layers.16.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
142
- "audio_tower.layers.16.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
143
- "audio_tower.layers.16.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
144
- "audio_tower.layers.16.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
145
- "audio_tower.layers.16.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
146
- "audio_tower.layers.16.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
147
- "audio_tower.layers.16.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
148
- "audio_tower.layers.16.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
149
- "audio_tower.layers.17.fc1.bias": "model-00001-of-00011.safetensors",
150
- "audio_tower.layers.17.fc1.weight": "model-00001-of-00011.safetensors",
151
- "audio_tower.layers.17.fc2.bias": "model-00001-of-00011.safetensors",
152
- "audio_tower.layers.17.fc2.weight": "model-00001-of-00011.safetensors",
153
- "audio_tower.layers.17.final_layer_norm.bias": "model-00001-of-00011.safetensors",
154
- "audio_tower.layers.17.final_layer_norm.weight": "model-00001-of-00011.safetensors",
155
- "audio_tower.layers.17.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
156
- "audio_tower.layers.17.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
157
- "audio_tower.layers.17.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
158
- "audio_tower.layers.17.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
159
- "audio_tower.layers.17.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
160
- "audio_tower.layers.17.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
161
- "audio_tower.layers.17.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
162
- "audio_tower.layers.17.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
163
- "audio_tower.layers.17.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
164
- "audio_tower.layers.18.fc1.bias": "model-00001-of-00011.safetensors",
165
- "audio_tower.layers.18.fc1.weight": "model-00001-of-00011.safetensors",
166
- "audio_tower.layers.18.fc2.bias": "model-00001-of-00011.safetensors",
167
- "audio_tower.layers.18.fc2.weight": "model-00001-of-00011.safetensors",
168
- "audio_tower.layers.18.final_layer_norm.bias": "model-00001-of-00011.safetensors",
169
- "audio_tower.layers.18.final_layer_norm.weight": "model-00001-of-00011.safetensors",
170
- "audio_tower.layers.18.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
171
- "audio_tower.layers.18.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
172
- "audio_tower.layers.18.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
173
- "audio_tower.layers.18.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
174
- "audio_tower.layers.18.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
175
- "audio_tower.layers.18.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
176
- "audio_tower.layers.18.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
177
- "audio_tower.layers.18.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
178
- "audio_tower.layers.18.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
179
- "audio_tower.layers.19.fc1.bias": "model-00001-of-00011.safetensors",
180
- "audio_tower.layers.19.fc1.weight": "model-00001-of-00011.safetensors",
181
- "audio_tower.layers.19.fc2.bias": "model-00001-of-00011.safetensors",
182
- "audio_tower.layers.19.fc2.weight": "model-00001-of-00011.safetensors",
183
- "audio_tower.layers.19.final_layer_norm.bias": "model-00001-of-00011.safetensors",
184
- "audio_tower.layers.19.final_layer_norm.weight": "model-00001-of-00011.safetensors",
185
- "audio_tower.layers.19.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
186
- "audio_tower.layers.19.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
187
- "audio_tower.layers.19.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
188
- "audio_tower.layers.19.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
189
- "audio_tower.layers.19.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
190
- "audio_tower.layers.19.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
191
- "audio_tower.layers.19.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
192
- "audio_tower.layers.19.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
193
- "audio_tower.layers.19.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
194
- "audio_tower.layers.2.fc1.bias": "model-00001-of-00011.safetensors",
195
- "audio_tower.layers.2.fc1.weight": "model-00001-of-00011.safetensors",
196
- "audio_tower.layers.2.fc2.bias": "model-00001-of-00011.safetensors",
197
- "audio_tower.layers.2.fc2.weight": "model-00001-of-00011.safetensors",
198
- "audio_tower.layers.2.final_layer_norm.bias": "model-00001-of-00011.safetensors",
199
- "audio_tower.layers.2.final_layer_norm.weight": "model-00001-of-00011.safetensors",
200
- "audio_tower.layers.2.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
201
- "audio_tower.layers.2.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
202
- "audio_tower.layers.2.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
203
- "audio_tower.layers.2.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
204
- "audio_tower.layers.2.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
205
- "audio_tower.layers.2.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
206
- "audio_tower.layers.2.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
207
- "audio_tower.layers.2.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
208
- "audio_tower.layers.2.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
209
- "audio_tower.layers.20.fc1.bias": "model-00001-of-00011.safetensors",
210
- "audio_tower.layers.20.fc1.weight": "model-00001-of-00011.safetensors",
211
- "audio_tower.layers.20.fc2.bias": "model-00001-of-00011.safetensors",
212
- "audio_tower.layers.20.fc2.weight": "model-00001-of-00011.safetensors",
213
- "audio_tower.layers.20.final_layer_norm.bias": "model-00001-of-00011.safetensors",
214
- "audio_tower.layers.20.final_layer_norm.weight": "model-00001-of-00011.safetensors",
215
- "audio_tower.layers.20.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
216
- "audio_tower.layers.20.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
217
- "audio_tower.layers.20.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
218
- "audio_tower.layers.20.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
219
- "audio_tower.layers.20.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
220
- "audio_tower.layers.20.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
221
- "audio_tower.layers.20.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
222
- "audio_tower.layers.20.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
223
- "audio_tower.layers.20.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
224
- "audio_tower.layers.21.fc1.bias": "model-00001-of-00011.safetensors",
225
- "audio_tower.layers.21.fc1.weight": "model-00001-of-00011.safetensors",
226
- "audio_tower.layers.21.fc2.bias": "model-00001-of-00011.safetensors",
227
- "audio_tower.layers.21.fc2.weight": "model-00001-of-00011.safetensors",
228
- "audio_tower.layers.21.final_layer_norm.bias": "model-00001-of-00011.safetensors",
229
- "audio_tower.layers.21.final_layer_norm.weight": "model-00001-of-00011.safetensors",
230
- "audio_tower.layers.21.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
231
- "audio_tower.layers.21.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
232
- "audio_tower.layers.21.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
233
- "audio_tower.layers.21.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
234
- "audio_tower.layers.21.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
235
- "audio_tower.layers.21.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
236
- "audio_tower.layers.21.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
237
- "audio_tower.layers.21.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
238
- "audio_tower.layers.21.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
239
- "audio_tower.layers.22.fc1.bias": "model-00001-of-00011.safetensors",
240
- "audio_tower.layers.22.fc1.weight": "model-00001-of-00011.safetensors",
241
- "audio_tower.layers.22.fc2.bias": "model-00001-of-00011.safetensors",
242
- "audio_tower.layers.22.fc2.weight": "model-00001-of-00011.safetensors",
243
- "audio_tower.layers.22.final_layer_norm.bias": "model-00001-of-00011.safetensors",
244
- "audio_tower.layers.22.final_layer_norm.weight": "model-00001-of-00011.safetensors",
245
- "audio_tower.layers.22.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
246
- "audio_tower.layers.22.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
247
- "audio_tower.layers.22.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
248
- "audio_tower.layers.22.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
249
- "audio_tower.layers.22.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
250
- "audio_tower.layers.22.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
251
- "audio_tower.layers.22.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
252
- "audio_tower.layers.22.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
253
- "audio_tower.layers.22.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
254
- "audio_tower.layers.23.fc1.bias": "model-00001-of-00011.safetensors",
255
- "audio_tower.layers.23.fc1.weight": "model-00001-of-00011.safetensors",
256
- "audio_tower.layers.23.fc2.bias": "model-00001-of-00011.safetensors",
257
- "audio_tower.layers.23.fc2.weight": "model-00001-of-00011.safetensors",
258
- "audio_tower.layers.23.final_layer_norm.bias": "model-00001-of-00011.safetensors",
259
- "audio_tower.layers.23.final_layer_norm.weight": "model-00001-of-00011.safetensors",
260
- "audio_tower.layers.23.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
261
- "audio_tower.layers.23.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
262
- "audio_tower.layers.23.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
263
- "audio_tower.layers.23.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
264
- "audio_tower.layers.23.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
265
- "audio_tower.layers.23.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
266
- "audio_tower.layers.23.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
267
- "audio_tower.layers.23.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
268
- "audio_tower.layers.23.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
269
- "audio_tower.layers.24.fc1.bias": "model-00001-of-00011.safetensors",
270
- "audio_tower.layers.24.fc1.weight": "model-00001-of-00011.safetensors",
271
- "audio_tower.layers.24.fc2.bias": "model-00001-of-00011.safetensors",
272
- "audio_tower.layers.24.fc2.weight": "model-00001-of-00011.safetensors",
273
- "audio_tower.layers.24.final_layer_norm.bias": "model-00001-of-00011.safetensors",
274
- "audio_tower.layers.24.final_layer_norm.weight": "model-00001-of-00011.safetensors",
275
- "audio_tower.layers.24.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
276
- "audio_tower.layers.24.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
277
- "audio_tower.layers.24.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
278
- "audio_tower.layers.24.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
279
- "audio_tower.layers.24.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
280
- "audio_tower.layers.24.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
281
- "audio_tower.layers.24.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
282
- "audio_tower.layers.24.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
283
- "audio_tower.layers.24.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
284
- "audio_tower.layers.25.fc1.bias": "model-00001-of-00011.safetensors",
285
- "audio_tower.layers.25.fc1.weight": "model-00001-of-00011.safetensors",
286
- "audio_tower.layers.25.fc2.bias": "model-00001-of-00011.safetensors",
287
- "audio_tower.layers.25.fc2.weight": "model-00001-of-00011.safetensors",
288
- "audio_tower.layers.25.final_layer_norm.bias": "model-00001-of-00011.safetensors",
289
- "audio_tower.layers.25.final_layer_norm.weight": "model-00001-of-00011.safetensors",
290
- "audio_tower.layers.25.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
291
- "audio_tower.layers.25.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
292
- "audio_tower.layers.25.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
293
- "audio_tower.layers.25.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
294
- "audio_tower.layers.25.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
295
- "audio_tower.layers.25.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
296
- "audio_tower.layers.25.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
297
- "audio_tower.layers.25.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
298
- "audio_tower.layers.25.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
299
- "audio_tower.layers.26.fc1.bias": "model-00001-of-00011.safetensors",
300
- "audio_tower.layers.26.fc1.weight": "model-00001-of-00011.safetensors",
301
- "audio_tower.layers.26.fc2.bias": "model-00001-of-00011.safetensors",
302
- "audio_tower.layers.26.fc2.weight": "model-00001-of-00011.safetensors",
303
- "audio_tower.layers.26.final_layer_norm.bias": "model-00001-of-00011.safetensors",
304
- "audio_tower.layers.26.final_layer_norm.weight": "model-00001-of-00011.safetensors",
305
- "audio_tower.layers.26.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
306
- "audio_tower.layers.26.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
307
- "audio_tower.layers.26.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
308
- "audio_tower.layers.26.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
309
- "audio_tower.layers.26.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
310
- "audio_tower.layers.26.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
311
- "audio_tower.layers.26.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
312
- "audio_tower.layers.26.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
313
- "audio_tower.layers.26.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
314
- "audio_tower.layers.27.fc1.bias": "model-00001-of-00011.safetensors",
315
- "audio_tower.layers.27.fc1.weight": "model-00001-of-00011.safetensors",
316
- "audio_tower.layers.27.fc2.bias": "model-00001-of-00011.safetensors",
317
- "audio_tower.layers.27.fc2.weight": "model-00001-of-00011.safetensors",
318
- "audio_tower.layers.27.final_layer_norm.bias": "model-00001-of-00011.safetensors",
319
- "audio_tower.layers.27.final_layer_norm.weight": "model-00001-of-00011.safetensors",
320
- "audio_tower.layers.27.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
321
- "audio_tower.layers.27.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
322
- "audio_tower.layers.27.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
323
- "audio_tower.layers.27.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
324
- "audio_tower.layers.27.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
325
- "audio_tower.layers.27.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
326
- "audio_tower.layers.27.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
327
- "audio_tower.layers.27.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
328
- "audio_tower.layers.27.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
329
- "audio_tower.layers.28.fc1.bias": "model-00001-of-00011.safetensors",
330
- "audio_tower.layers.28.fc1.weight": "model-00001-of-00011.safetensors",
331
- "audio_tower.layers.28.fc2.bias": "model-00001-of-00011.safetensors",
332
- "audio_tower.layers.28.fc2.weight": "model-00001-of-00011.safetensors",
333
- "audio_tower.layers.28.final_layer_norm.bias": "model-00001-of-00011.safetensors",
334
- "audio_tower.layers.28.final_layer_norm.weight": "model-00001-of-00011.safetensors",
335
- "audio_tower.layers.28.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
336
- "audio_tower.layers.28.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
337
- "audio_tower.layers.28.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
338
- "audio_tower.layers.28.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
339
- "audio_tower.layers.28.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
340
- "audio_tower.layers.28.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
341
- "audio_tower.layers.28.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
342
- "audio_tower.layers.28.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
343
- "audio_tower.layers.28.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
344
- "audio_tower.layers.29.fc1.bias": "model-00001-of-00011.safetensors",
345
- "audio_tower.layers.29.fc1.weight": "model-00001-of-00011.safetensors",
346
- "audio_tower.layers.29.fc2.bias": "model-00001-of-00011.safetensors",
347
- "audio_tower.layers.29.fc2.weight": "model-00001-of-00011.safetensors",
348
- "audio_tower.layers.29.final_layer_norm.bias": "model-00001-of-00011.safetensors",
349
- "audio_tower.layers.29.final_layer_norm.weight": "model-00001-of-00011.safetensors",
350
- "audio_tower.layers.29.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
351
- "audio_tower.layers.29.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
352
- "audio_tower.layers.29.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
353
- "audio_tower.layers.29.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
354
- "audio_tower.layers.29.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
355
- "audio_tower.layers.29.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
356
- "audio_tower.layers.29.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
357
- "audio_tower.layers.29.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
358
- "audio_tower.layers.29.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
359
- "audio_tower.layers.3.fc1.bias": "model-00001-of-00011.safetensors",
360
- "audio_tower.layers.3.fc1.weight": "model-00001-of-00011.safetensors",
361
- "audio_tower.layers.3.fc2.bias": "model-00001-of-00011.safetensors",
362
- "audio_tower.layers.3.fc2.weight": "model-00001-of-00011.safetensors",
363
- "audio_tower.layers.3.final_layer_norm.bias": "model-00001-of-00011.safetensors",
364
- "audio_tower.layers.3.final_layer_norm.weight": "model-00001-of-00011.safetensors",
365
- "audio_tower.layers.3.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
366
- "audio_tower.layers.3.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
367
- "audio_tower.layers.3.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
368
- "audio_tower.layers.3.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
369
- "audio_tower.layers.3.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
370
- "audio_tower.layers.3.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
371
- "audio_tower.layers.3.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
372
- "audio_tower.layers.3.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
373
- "audio_tower.layers.3.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
374
- "audio_tower.layers.30.fc1.bias": "model-00001-of-00011.safetensors",
375
- "audio_tower.layers.30.fc1.weight": "model-00001-of-00011.safetensors",
376
- "audio_tower.layers.30.fc2.bias": "model-00001-of-00011.safetensors",
377
- "audio_tower.layers.30.fc2.weight": "model-00001-of-00011.safetensors",
378
- "audio_tower.layers.30.final_layer_norm.bias": "model-00001-of-00011.safetensors",
379
- "audio_tower.layers.30.final_layer_norm.weight": "model-00001-of-00011.safetensors",
380
- "audio_tower.layers.30.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
381
- "audio_tower.layers.30.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
382
- "audio_tower.layers.30.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
383
- "audio_tower.layers.30.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
384
- "audio_tower.layers.30.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
385
- "audio_tower.layers.30.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
386
- "audio_tower.layers.30.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
387
- "audio_tower.layers.30.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
388
- "audio_tower.layers.30.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
389
- "audio_tower.layers.31.fc1.bias": "model-00001-of-00011.safetensors",
390
- "audio_tower.layers.31.fc1.weight": "model-00001-of-00011.safetensors",
391
- "audio_tower.layers.31.fc2.bias": "model-00001-of-00011.safetensors",
392
- "audio_tower.layers.31.fc2.weight": "model-00001-of-00011.safetensors",
393
- "audio_tower.layers.31.final_layer_norm.bias": "model-00001-of-00011.safetensors",
394
- "audio_tower.layers.31.final_layer_norm.weight": "model-00001-of-00011.safetensors",
395
- "audio_tower.layers.31.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
396
- "audio_tower.layers.31.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
397
- "audio_tower.layers.31.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
398
- "audio_tower.layers.31.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
399
- "audio_tower.layers.31.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
400
- "audio_tower.layers.31.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
401
- "audio_tower.layers.31.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
402
- "audio_tower.layers.31.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
403
- "audio_tower.layers.31.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
404
- "audio_tower.layers.4.fc1.bias": "model-00001-of-00011.safetensors",
405
- "audio_tower.layers.4.fc1.weight": "model-00001-of-00011.safetensors",
406
- "audio_tower.layers.4.fc2.bias": "model-00001-of-00011.safetensors",
407
- "audio_tower.layers.4.fc2.weight": "model-00001-of-00011.safetensors",
408
- "audio_tower.layers.4.final_layer_norm.bias": "model-00001-of-00011.safetensors",
409
- "audio_tower.layers.4.final_layer_norm.weight": "model-00001-of-00011.safetensors",
410
- "audio_tower.layers.4.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
411
- "audio_tower.layers.4.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
412
- "audio_tower.layers.4.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
413
- "audio_tower.layers.4.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
414
- "audio_tower.layers.4.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
415
- "audio_tower.layers.4.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
416
- "audio_tower.layers.4.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
417
- "audio_tower.layers.4.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
418
- "audio_tower.layers.4.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
419
- "audio_tower.layers.5.fc1.bias": "model-00001-of-00011.safetensors",
420
- "audio_tower.layers.5.fc1.weight": "model-00001-of-00011.safetensors",
421
- "audio_tower.layers.5.fc2.bias": "model-00001-of-00011.safetensors",
422
- "audio_tower.layers.5.fc2.weight": "model-00001-of-00011.safetensors",
423
- "audio_tower.layers.5.final_layer_norm.bias": "model-00001-of-00011.safetensors",
424
- "audio_tower.layers.5.final_layer_norm.weight": "model-00001-of-00011.safetensors",
425
- "audio_tower.layers.5.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
426
- "audio_tower.layers.5.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
427
- "audio_tower.layers.5.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
428
- "audio_tower.layers.5.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
429
- "audio_tower.layers.5.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
430
- "audio_tower.layers.5.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
431
- "audio_tower.layers.5.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
432
- "audio_tower.layers.5.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
433
- "audio_tower.layers.5.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
434
- "audio_tower.layers.6.fc1.bias": "model-00001-of-00011.safetensors",
435
- "audio_tower.layers.6.fc1.weight": "model-00001-of-00011.safetensors",
436
- "audio_tower.layers.6.fc2.bias": "model-00001-of-00011.safetensors",
437
- "audio_tower.layers.6.fc2.weight": "model-00001-of-00011.safetensors",
438
- "audio_tower.layers.6.final_layer_norm.bias": "model-00001-of-00011.safetensors",
439
- "audio_tower.layers.6.final_layer_norm.weight": "model-00001-of-00011.safetensors",
440
- "audio_tower.layers.6.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
441
- "audio_tower.layers.6.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
442
- "audio_tower.layers.6.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
443
- "audio_tower.layers.6.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
444
- "audio_tower.layers.6.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
445
- "audio_tower.layers.6.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
446
- "audio_tower.layers.6.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
447
- "audio_tower.layers.6.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
448
- "audio_tower.layers.6.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
449
- "audio_tower.layers.7.fc1.bias": "model-00001-of-00011.safetensors",
450
- "audio_tower.layers.7.fc1.weight": "model-00001-of-00011.safetensors",
451
- "audio_tower.layers.7.fc2.bias": "model-00001-of-00011.safetensors",
452
- "audio_tower.layers.7.fc2.weight": "model-00001-of-00011.safetensors",
453
- "audio_tower.layers.7.final_layer_norm.bias": "model-00001-of-00011.safetensors",
454
- "audio_tower.layers.7.final_layer_norm.weight": "model-00001-of-00011.safetensors",
455
- "audio_tower.layers.7.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
456
- "audio_tower.layers.7.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
457
- "audio_tower.layers.7.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
458
- "audio_tower.layers.7.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
459
- "audio_tower.layers.7.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
460
- "audio_tower.layers.7.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
461
- "audio_tower.layers.7.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
462
- "audio_tower.layers.7.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
463
- "audio_tower.layers.7.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
464
- "audio_tower.layers.8.fc1.bias": "model-00001-of-00011.safetensors",
465
- "audio_tower.layers.8.fc1.weight": "model-00001-of-00011.safetensors",
466
- "audio_tower.layers.8.fc2.bias": "model-00001-of-00011.safetensors",
467
- "audio_tower.layers.8.fc2.weight": "model-00001-of-00011.safetensors",
468
- "audio_tower.layers.8.final_layer_norm.bias": "model-00001-of-00011.safetensors",
469
- "audio_tower.layers.8.final_layer_norm.weight": "model-00001-of-00011.safetensors",
470
- "audio_tower.layers.8.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
471
- "audio_tower.layers.8.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
472
- "audio_tower.layers.8.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
473
- "audio_tower.layers.8.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
474
- "audio_tower.layers.8.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
475
- "audio_tower.layers.8.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
476
- "audio_tower.layers.8.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
477
- "audio_tower.layers.8.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
478
- "audio_tower.layers.8.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
479
- "audio_tower.layers.9.fc1.bias": "model-00001-of-00011.safetensors",
480
- "audio_tower.layers.9.fc1.weight": "model-00001-of-00011.safetensors",
481
- "audio_tower.layers.9.fc2.bias": "model-00001-of-00011.safetensors",
482
- "audio_tower.layers.9.fc2.weight": "model-00001-of-00011.safetensors",
483
- "audio_tower.layers.9.final_layer_norm.bias": "model-00001-of-00011.safetensors",
484
- "audio_tower.layers.9.final_layer_norm.weight": "model-00001-of-00011.safetensors",
485
- "audio_tower.layers.9.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
486
- "audio_tower.layers.9.self_attn.out_proj.bias": "model-00001-of-00011.safetensors",
487
- "audio_tower.layers.9.self_attn.out_proj.weight": "model-00001-of-00011.safetensors",
488
- "audio_tower.layers.9.self_attn.q_proj.bias": "model-00001-of-00011.safetensors",
489
- "audio_tower.layers.9.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
490
- "audio_tower.layers.9.self_attn.v_proj.bias": "model-00001-of-00011.safetensors",
491
- "audio_tower.layers.9.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
492
- "audio_tower.layers.9.self_attn_layer_norm.bias": "model-00001-of-00011.safetensors",
493
- "audio_tower.layers.9.self_attn_layer_norm.weight": "model-00001-of-00011.safetensors",
494
- "language_model.lm_head.weight": "model-00011-of-00011.safetensors",
495
- "language_model.model.embed_tokens.weight": "model-00001-of-00011.safetensors",
496
- "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00011.safetensors",
497
- "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00011.safetensors",
498
- "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00011.safetensors",
499
- "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00011.safetensors",
500
- "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00011.safetensors",
501
- "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
502
- "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
503
- "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
504
- "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
505
- "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00011.safetensors",
506
- "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00011.safetensors",
507
- "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00011.safetensors",
508
- "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00011.safetensors",
509
- "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00011.safetensors",
510
- "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
511
- "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
512
- "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
513
- "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
514
- "language_model.model.layers.10.input_layernorm.weight": "model-00004-of-00011.safetensors",
515
- "language_model.model.layers.10.mlp.down_proj.weight": "model-00004-of-00011.safetensors",
516
- "language_model.model.layers.10.mlp.gate_proj.weight": "model-00003-of-00011.safetensors",
517
- "language_model.model.layers.10.mlp.up_proj.weight": "model-00003-of-00011.safetensors",
518
- "language_model.model.layers.10.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
519
- "language_model.model.layers.10.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
520
- "language_model.model.layers.10.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
521
- "language_model.model.layers.10.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
522
- "language_model.model.layers.10.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
523
- "language_model.model.layers.11.input_layernorm.weight": "model-00004-of-00011.safetensors",
524
- "language_model.model.layers.11.mlp.down_proj.weight": "model-00004-of-00011.safetensors",
525
- "language_model.model.layers.11.mlp.gate_proj.weight": "model-00004-of-00011.safetensors",
526
- "language_model.model.layers.11.mlp.up_proj.weight": "model-00004-of-00011.safetensors",
527
- "language_model.model.layers.11.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
528
- "language_model.model.layers.11.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
529
- "language_model.model.layers.11.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
530
- "language_model.model.layers.11.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
531
- "language_model.model.layers.11.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
532
- "language_model.model.layers.12.input_layernorm.weight": "model-00004-of-00011.safetensors",
533
- "language_model.model.layers.12.mlp.down_proj.weight": "model-00004-of-00011.safetensors",
534
- "language_model.model.layers.12.mlp.gate_proj.weight": "model-00004-of-00011.safetensors",
535
- "language_model.model.layers.12.mlp.up_proj.weight": "model-00004-of-00011.safetensors",
536
- "language_model.model.layers.12.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
537
- "language_model.model.layers.12.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
538
- "language_model.model.layers.12.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
539
- "language_model.model.layers.12.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
540
- "language_model.model.layers.12.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
541
- "language_model.model.layers.13.input_layernorm.weight": "model-00004-of-00011.safetensors",
542
- "language_model.model.layers.13.mlp.down_proj.weight": "model-00004-of-00011.safetensors",
543
- "language_model.model.layers.13.mlp.gate_proj.weight": "model-00004-of-00011.safetensors",
544
- "language_model.model.layers.13.mlp.up_proj.weight": "model-00004-of-00011.safetensors",
545
- "language_model.model.layers.13.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
546
- "language_model.model.layers.13.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
547
- "language_model.model.layers.13.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
548
- "language_model.model.layers.13.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
549
- "language_model.model.layers.13.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
550
- "language_model.model.layers.14.input_layernorm.weight": "model-00004-of-00011.safetensors",
551
- "language_model.model.layers.14.mlp.down_proj.weight": "model-00004-of-00011.safetensors",
552
- "language_model.model.layers.14.mlp.gate_proj.weight": "model-00004-of-00011.safetensors",
553
- "language_model.model.layers.14.mlp.up_proj.weight": "model-00004-of-00011.safetensors",
554
- "language_model.model.layers.14.post_attention_layernorm.weight": "model-00004-of-00011.safetensors",
555
- "language_model.model.layers.14.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
556
- "language_model.model.layers.14.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
557
- "language_model.model.layers.14.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
558
- "language_model.model.layers.14.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
559
- "language_model.model.layers.15.input_layernorm.weight": "model-00005-of-00011.safetensors",
560
- "language_model.model.layers.15.mlp.down_proj.weight": "model-00005-of-00011.safetensors",
561
- "language_model.model.layers.15.mlp.gate_proj.weight": "model-00005-of-00011.safetensors",
562
- "language_model.model.layers.15.mlp.up_proj.weight": "model-00005-of-00011.safetensors",
563
- "language_model.model.layers.15.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
564
- "language_model.model.layers.15.self_attn.k_proj.weight": "model-00004-of-00011.safetensors",
565
- "language_model.model.layers.15.self_attn.o_proj.weight": "model-00004-of-00011.safetensors",
566
- "language_model.model.layers.15.self_attn.q_proj.weight": "model-00004-of-00011.safetensors",
567
- "language_model.model.layers.15.self_attn.v_proj.weight": "model-00004-of-00011.safetensors",
568
- "language_model.model.layers.16.input_layernorm.weight": "model-00005-of-00011.safetensors",
569
- "language_model.model.layers.16.mlp.down_proj.weight": "model-00005-of-00011.safetensors",
570
- "language_model.model.layers.16.mlp.gate_proj.weight": "model-00005-of-00011.safetensors",
571
- "language_model.model.layers.16.mlp.up_proj.weight": "model-00005-of-00011.safetensors",
572
- "language_model.model.layers.16.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
573
- "language_model.model.layers.16.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
574
- "language_model.model.layers.16.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
575
- "language_model.model.layers.16.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
576
- "language_model.model.layers.16.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
577
- "language_model.model.layers.17.input_layernorm.weight": "model-00005-of-00011.safetensors",
578
- "language_model.model.layers.17.mlp.down_proj.weight": "model-00005-of-00011.safetensors",
579
- "language_model.model.layers.17.mlp.gate_proj.weight": "model-00005-of-00011.safetensors",
580
- "language_model.model.layers.17.mlp.up_proj.weight": "model-00005-of-00011.safetensors",
581
- "language_model.model.layers.17.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
582
- "language_model.model.layers.17.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
583
- "language_model.model.layers.17.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
584
- "language_model.model.layers.17.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
585
- "language_model.model.layers.17.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
586
- "language_model.model.layers.18.input_layernorm.weight": "model-00005-of-00011.safetensors",
587
- "language_model.model.layers.18.mlp.down_proj.weight": "model-00005-of-00011.safetensors",
588
- "language_model.model.layers.18.mlp.gate_proj.weight": "model-00005-of-00011.safetensors",
589
- "language_model.model.layers.18.mlp.up_proj.weight": "model-00005-of-00011.safetensors",
590
- "language_model.model.layers.18.post_attention_layernorm.weight": "model-00005-of-00011.safetensors",
591
- "language_model.model.layers.18.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
592
- "language_model.model.layers.18.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
593
- "language_model.model.layers.18.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
594
- "language_model.model.layers.18.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
595
- "language_model.model.layers.19.input_layernorm.weight": "model-00006-of-00011.safetensors",
596
- "language_model.model.layers.19.mlp.down_proj.weight": "model-00006-of-00011.safetensors",
597
- "language_model.model.layers.19.mlp.gate_proj.weight": "model-00005-of-00011.safetensors",
598
- "language_model.model.layers.19.mlp.up_proj.weight": "model-00006-of-00011.safetensors",
599
- "language_model.model.layers.19.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
600
- "language_model.model.layers.19.self_attn.k_proj.weight": "model-00005-of-00011.safetensors",
601
- "language_model.model.layers.19.self_attn.o_proj.weight": "model-00005-of-00011.safetensors",
602
- "language_model.model.layers.19.self_attn.q_proj.weight": "model-00005-of-00011.safetensors",
603
- "language_model.model.layers.19.self_attn.v_proj.weight": "model-00005-of-00011.safetensors",
604
- "language_model.model.layers.2.input_layernorm.weight": "model-00002-of-00011.safetensors",
605
- "language_model.model.layers.2.mlp.down_proj.weight": "model-00002-of-00011.safetensors",
606
- "language_model.model.layers.2.mlp.gate_proj.weight": "model-00002-of-00011.safetensors",
607
- "language_model.model.layers.2.mlp.up_proj.weight": "model-00002-of-00011.safetensors",
608
- "language_model.model.layers.2.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
609
- "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00011.safetensors",
610
- "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00011.safetensors",
611
- "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00011.safetensors",
612
- "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00011.safetensors",
613
- "language_model.model.layers.20.input_layernorm.weight": "model-00006-of-00011.safetensors",
614
- "language_model.model.layers.20.mlp.down_proj.weight": "model-00006-of-00011.safetensors",
615
- "language_model.model.layers.20.mlp.gate_proj.weight": "model-00006-of-00011.safetensors",
616
- "language_model.model.layers.20.mlp.up_proj.weight": "model-00006-of-00011.safetensors",
617
- "language_model.model.layers.20.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
618
- "language_model.model.layers.20.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
619
- "language_model.model.layers.20.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
620
- "language_model.model.layers.20.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
621
- "language_model.model.layers.20.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
622
- "language_model.model.layers.21.input_layernorm.weight": "model-00006-of-00011.safetensors",
623
- "language_model.model.layers.21.mlp.down_proj.weight": "model-00006-of-00011.safetensors",
624
- "language_model.model.layers.21.mlp.gate_proj.weight": "model-00006-of-00011.safetensors",
625
- "language_model.model.layers.21.mlp.up_proj.weight": "model-00006-of-00011.safetensors",
626
- "language_model.model.layers.21.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
627
- "language_model.model.layers.21.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
628
- "language_model.model.layers.21.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
629
- "language_model.model.layers.21.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
630
- "language_model.model.layers.21.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
631
- "language_model.model.layers.22.input_layernorm.weight": "model-00006-of-00011.safetensors",
632
- "language_model.model.layers.22.mlp.down_proj.weight": "model-00006-of-00011.safetensors",
633
- "language_model.model.layers.22.mlp.gate_proj.weight": "model-00006-of-00011.safetensors",
634
- "language_model.model.layers.22.mlp.up_proj.weight": "model-00006-of-00011.safetensors",
635
- "language_model.model.layers.22.post_attention_layernorm.weight": "model-00006-of-00011.safetensors",
636
- "language_model.model.layers.22.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
637
- "language_model.model.layers.22.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
638
- "language_model.model.layers.22.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
639
- "language_model.model.layers.22.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
640
- "language_model.model.layers.23.input_layernorm.weight": "model-00007-of-00011.safetensors",
641
- "language_model.model.layers.23.mlp.down_proj.weight": "model-00007-of-00011.safetensors",
642
- "language_model.model.layers.23.mlp.gate_proj.weight": "model-00006-of-00011.safetensors",
643
- "language_model.model.layers.23.mlp.up_proj.weight": "model-00006-of-00011.safetensors",
644
- "language_model.model.layers.23.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
645
- "language_model.model.layers.23.self_attn.k_proj.weight": "model-00006-of-00011.safetensors",
646
- "language_model.model.layers.23.self_attn.o_proj.weight": "model-00006-of-00011.safetensors",
647
- "language_model.model.layers.23.self_attn.q_proj.weight": "model-00006-of-00011.safetensors",
648
- "language_model.model.layers.23.self_attn.v_proj.weight": "model-00006-of-00011.safetensors",
649
- "language_model.model.layers.24.input_layernorm.weight": "model-00007-of-00011.safetensors",
650
- "language_model.model.layers.24.mlp.down_proj.weight": "model-00007-of-00011.safetensors",
651
- "language_model.model.layers.24.mlp.gate_proj.weight": "model-00007-of-00011.safetensors",
652
- "language_model.model.layers.24.mlp.up_proj.weight": "model-00007-of-00011.safetensors",
653
- "language_model.model.layers.24.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
654
- "language_model.model.layers.24.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
655
- "language_model.model.layers.24.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
656
- "language_model.model.layers.24.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
657
- "language_model.model.layers.24.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
658
- "language_model.model.layers.25.input_layernorm.weight": "model-00007-of-00011.safetensors",
659
- "language_model.model.layers.25.mlp.down_proj.weight": "model-00007-of-00011.safetensors",
660
- "language_model.model.layers.25.mlp.gate_proj.weight": "model-00007-of-00011.safetensors",
661
- "language_model.model.layers.25.mlp.up_proj.weight": "model-00007-of-00011.safetensors",
662
- "language_model.model.layers.25.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
663
- "language_model.model.layers.25.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
664
- "language_model.model.layers.25.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
665
- "language_model.model.layers.25.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
666
- "language_model.model.layers.25.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
667
- "language_model.model.layers.26.input_layernorm.weight": "model-00007-of-00011.safetensors",
668
- "language_model.model.layers.26.mlp.down_proj.weight": "model-00007-of-00011.safetensors",
669
- "language_model.model.layers.26.mlp.gate_proj.weight": "model-00007-of-00011.safetensors",
670
- "language_model.model.layers.26.mlp.up_proj.weight": "model-00007-of-00011.safetensors",
671
- "language_model.model.layers.26.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
672
- "language_model.model.layers.26.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
673
- "language_model.model.layers.26.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
674
- "language_model.model.layers.26.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
675
- "language_model.model.layers.26.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
676
- "language_model.model.layers.27.input_layernorm.weight": "model-00007-of-00011.safetensors",
677
- "language_model.model.layers.27.mlp.down_proj.weight": "model-00007-of-00011.safetensors",
678
- "language_model.model.layers.27.mlp.gate_proj.weight": "model-00007-of-00011.safetensors",
679
- "language_model.model.layers.27.mlp.up_proj.weight": "model-00007-of-00011.safetensors",
680
- "language_model.model.layers.27.post_attention_layernorm.weight": "model-00007-of-00011.safetensors",
681
- "language_model.model.layers.27.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
682
- "language_model.model.layers.27.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
683
- "language_model.model.layers.27.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
684
- "language_model.model.layers.27.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
685
- "language_model.model.layers.28.input_layernorm.weight": "model-00008-of-00011.safetensors",
686
- "language_model.model.layers.28.mlp.down_proj.weight": "model-00008-of-00011.safetensors",
687
- "language_model.model.layers.28.mlp.gate_proj.weight": "model-00008-of-00011.safetensors",
688
- "language_model.model.layers.28.mlp.up_proj.weight": "model-00008-of-00011.safetensors",
689
- "language_model.model.layers.28.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
690
- "language_model.model.layers.28.self_attn.k_proj.weight": "model-00007-of-00011.safetensors",
691
- "language_model.model.layers.28.self_attn.o_proj.weight": "model-00007-of-00011.safetensors",
692
- "language_model.model.layers.28.self_attn.q_proj.weight": "model-00007-of-00011.safetensors",
693
- "language_model.model.layers.28.self_attn.v_proj.weight": "model-00007-of-00011.safetensors",
694
- "language_model.model.layers.29.input_layernorm.weight": "model-00008-of-00011.safetensors",
695
- "language_model.model.layers.29.mlp.down_proj.weight": "model-00008-of-00011.safetensors",
696
- "language_model.model.layers.29.mlp.gate_proj.weight": "model-00008-of-00011.safetensors",
697
- "language_model.model.layers.29.mlp.up_proj.weight": "model-00008-of-00011.safetensors",
698
- "language_model.model.layers.29.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
699
- "language_model.model.layers.29.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
700
- "language_model.model.layers.29.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
701
- "language_model.model.layers.29.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
702
- "language_model.model.layers.29.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
703
- "language_model.model.layers.3.input_layernorm.weight": "model-00002-of-00011.safetensors",
704
- "language_model.model.layers.3.mlp.down_proj.weight": "model-00002-of-00011.safetensors",
705
- "language_model.model.layers.3.mlp.gate_proj.weight": "model-00002-of-00011.safetensors",
706
- "language_model.model.layers.3.mlp.up_proj.weight": "model-00002-of-00011.safetensors",
707
- "language_model.model.layers.3.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
708
- "language_model.model.layers.3.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
709
- "language_model.model.layers.3.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
710
- "language_model.model.layers.3.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
711
- "language_model.model.layers.3.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
712
- "language_model.model.layers.30.input_layernorm.weight": "model-00008-of-00011.safetensors",
713
- "language_model.model.layers.30.mlp.down_proj.weight": "model-00008-of-00011.safetensors",
714
- "language_model.model.layers.30.mlp.gate_proj.weight": "model-00008-of-00011.safetensors",
715
- "language_model.model.layers.30.mlp.up_proj.weight": "model-00008-of-00011.safetensors",
716
- "language_model.model.layers.30.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
717
- "language_model.model.layers.30.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
718
- "language_model.model.layers.30.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
719
- "language_model.model.layers.30.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
720
- "language_model.model.layers.30.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
721
- "language_model.model.layers.31.input_layernorm.weight": "model-00008-of-00011.safetensors",
722
- "language_model.model.layers.31.mlp.down_proj.weight": "model-00008-of-00011.safetensors",
723
- "language_model.model.layers.31.mlp.gate_proj.weight": "model-00008-of-00011.safetensors",
724
- "language_model.model.layers.31.mlp.up_proj.weight": "model-00008-of-00011.safetensors",
725
- "language_model.model.layers.31.post_attention_layernorm.weight": "model-00008-of-00011.safetensors",
726
- "language_model.model.layers.31.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
727
- "language_model.model.layers.31.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
728
- "language_model.model.layers.31.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
729
- "language_model.model.layers.31.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
730
- "language_model.model.layers.32.input_layernorm.weight": "model-00009-of-00011.safetensors",
731
- "language_model.model.layers.32.mlp.down_proj.weight": "model-00009-of-00011.safetensors",
732
- "language_model.model.layers.32.mlp.gate_proj.weight": "model-00008-of-00011.safetensors",
733
- "language_model.model.layers.32.mlp.up_proj.weight": "model-00009-of-00011.safetensors",
734
- "language_model.model.layers.32.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
735
- "language_model.model.layers.32.self_attn.k_proj.weight": "model-00008-of-00011.safetensors",
736
- "language_model.model.layers.32.self_attn.o_proj.weight": "model-00008-of-00011.safetensors",
737
- "language_model.model.layers.32.self_attn.q_proj.weight": "model-00008-of-00011.safetensors",
738
- "language_model.model.layers.32.self_attn.v_proj.weight": "model-00008-of-00011.safetensors",
739
- "language_model.model.layers.33.input_layernorm.weight": "model-00009-of-00011.safetensors",
740
- "language_model.model.layers.33.mlp.down_proj.weight": "model-00009-of-00011.safetensors",
741
- "language_model.model.layers.33.mlp.gate_proj.weight": "model-00009-of-00011.safetensors",
742
- "language_model.model.layers.33.mlp.up_proj.weight": "model-00009-of-00011.safetensors",
743
- "language_model.model.layers.33.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
744
- "language_model.model.layers.33.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
745
- "language_model.model.layers.33.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
746
- "language_model.model.layers.33.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
747
- "language_model.model.layers.33.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
748
- "language_model.model.layers.34.input_layernorm.weight": "model-00009-of-00011.safetensors",
749
- "language_model.model.layers.34.mlp.down_proj.weight": "model-00009-of-00011.safetensors",
750
- "language_model.model.layers.34.mlp.gate_proj.weight": "model-00009-of-00011.safetensors",
751
- "language_model.model.layers.34.mlp.up_proj.weight": "model-00009-of-00011.safetensors",
752
- "language_model.model.layers.34.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
753
- "language_model.model.layers.34.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
754
- "language_model.model.layers.34.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
755
- "language_model.model.layers.34.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
756
- "language_model.model.layers.34.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
757
- "language_model.model.layers.35.input_layernorm.weight": "model-00009-of-00011.safetensors",
758
- "language_model.model.layers.35.mlp.down_proj.weight": "model-00009-of-00011.safetensors",
759
- "language_model.model.layers.35.mlp.gate_proj.weight": "model-00009-of-00011.safetensors",
760
- "language_model.model.layers.35.mlp.up_proj.weight": "model-00009-of-00011.safetensors",
761
- "language_model.model.layers.35.post_attention_layernorm.weight": "model-00009-of-00011.safetensors",
762
- "language_model.model.layers.35.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
763
- "language_model.model.layers.35.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
764
- "language_model.model.layers.35.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
765
- "language_model.model.layers.35.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
766
- "language_model.model.layers.36.input_layernorm.weight": "model-00010-of-00011.safetensors",
767
- "language_model.model.layers.36.mlp.down_proj.weight": "model-00010-of-00011.safetensors",
768
- "language_model.model.layers.36.mlp.gate_proj.weight": "model-00009-of-00011.safetensors",
769
- "language_model.model.layers.36.mlp.up_proj.weight": "model-00009-of-00011.safetensors",
770
- "language_model.model.layers.36.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
771
- "language_model.model.layers.36.self_attn.k_proj.weight": "model-00009-of-00011.safetensors",
772
- "language_model.model.layers.36.self_attn.o_proj.weight": "model-00009-of-00011.safetensors",
773
- "language_model.model.layers.36.self_attn.q_proj.weight": "model-00009-of-00011.safetensors",
774
- "language_model.model.layers.36.self_attn.v_proj.weight": "model-00009-of-00011.safetensors",
775
- "language_model.model.layers.37.input_layernorm.weight": "model-00010-of-00011.safetensors",
776
- "language_model.model.layers.37.mlp.down_proj.weight": "model-00010-of-00011.safetensors",
777
- "language_model.model.layers.37.mlp.gate_proj.weight": "model-00010-of-00011.safetensors",
778
- "language_model.model.layers.37.mlp.up_proj.weight": "model-00010-of-00011.safetensors",
779
- "language_model.model.layers.37.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
780
- "language_model.model.layers.37.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
781
- "language_model.model.layers.37.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
782
- "language_model.model.layers.37.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
783
- "language_model.model.layers.37.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
784
- "language_model.model.layers.38.input_layernorm.weight": "model-00010-of-00011.safetensors",
785
- "language_model.model.layers.38.mlp.down_proj.weight": "model-00010-of-00011.safetensors",
786
- "language_model.model.layers.38.mlp.gate_proj.weight": "model-00010-of-00011.safetensors",
787
- "language_model.model.layers.38.mlp.up_proj.weight": "model-00010-of-00011.safetensors",
788
- "language_model.model.layers.38.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
789
- "language_model.model.layers.38.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
790
- "language_model.model.layers.38.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
791
- "language_model.model.layers.38.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
792
- "language_model.model.layers.38.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
793
- "language_model.model.layers.39.input_layernorm.weight": "model-00010-of-00011.safetensors",
794
- "language_model.model.layers.39.mlp.down_proj.weight": "model-00010-of-00011.safetensors",
795
- "language_model.model.layers.39.mlp.gate_proj.weight": "model-00010-of-00011.safetensors",
796
- "language_model.model.layers.39.mlp.up_proj.weight": "model-00010-of-00011.safetensors",
797
- "language_model.model.layers.39.post_attention_layernorm.weight": "model-00010-of-00011.safetensors",
798
- "language_model.model.layers.39.self_attn.k_proj.weight": "model-00010-of-00011.safetensors",
799
- "language_model.model.layers.39.self_attn.o_proj.weight": "model-00010-of-00011.safetensors",
800
- "language_model.model.layers.39.self_attn.q_proj.weight": "model-00010-of-00011.safetensors",
801
- "language_model.model.layers.39.self_attn.v_proj.weight": "model-00010-of-00011.safetensors",
802
- "language_model.model.layers.4.input_layernorm.weight": "model-00002-of-00011.safetensors",
803
- "language_model.model.layers.4.mlp.down_proj.weight": "model-00002-of-00011.safetensors",
804
- "language_model.model.layers.4.mlp.gate_proj.weight": "model-00002-of-00011.safetensors",
805
- "language_model.model.layers.4.mlp.up_proj.weight": "model-00002-of-00011.safetensors",
806
- "language_model.model.layers.4.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
807
- "language_model.model.layers.4.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
808
- "language_model.model.layers.4.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
809
- "language_model.model.layers.4.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
810
- "language_model.model.layers.4.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
811
- "language_model.model.layers.5.input_layernorm.weight": "model-00002-of-00011.safetensors",
812
- "language_model.model.layers.5.mlp.down_proj.weight": "model-00002-of-00011.safetensors",
813
- "language_model.model.layers.5.mlp.gate_proj.weight": "model-00002-of-00011.safetensors",
814
- "language_model.model.layers.5.mlp.up_proj.weight": "model-00002-of-00011.safetensors",
815
- "language_model.model.layers.5.post_attention_layernorm.weight": "model-00002-of-00011.safetensors",
816
- "language_model.model.layers.5.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
817
- "language_model.model.layers.5.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
818
- "language_model.model.layers.5.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
819
- "language_model.model.layers.5.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
820
- "language_model.model.layers.6.input_layernorm.weight": "model-00003-of-00011.safetensors",
821
- "language_model.model.layers.6.mlp.down_proj.weight": "model-00003-of-00011.safetensors",
822
- "language_model.model.layers.6.mlp.gate_proj.weight": "model-00002-of-00011.safetensors",
823
- "language_model.model.layers.6.mlp.up_proj.weight": "model-00003-of-00011.safetensors",
824
- "language_model.model.layers.6.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
825
- "language_model.model.layers.6.self_attn.k_proj.weight": "model-00002-of-00011.safetensors",
826
- "language_model.model.layers.6.self_attn.o_proj.weight": "model-00002-of-00011.safetensors",
827
- "language_model.model.layers.6.self_attn.q_proj.weight": "model-00002-of-00011.safetensors",
828
- "language_model.model.layers.6.self_attn.v_proj.weight": "model-00002-of-00011.safetensors",
829
- "language_model.model.layers.7.input_layernorm.weight": "model-00003-of-00011.safetensors",
830
- "language_model.model.layers.7.mlp.down_proj.weight": "model-00003-of-00011.safetensors",
831
- "language_model.model.layers.7.mlp.gate_proj.weight": "model-00003-of-00011.safetensors",
832
- "language_model.model.layers.7.mlp.up_proj.weight": "model-00003-of-00011.safetensors",
833
- "language_model.model.layers.7.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
834
- "language_model.model.layers.7.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
835
- "language_model.model.layers.7.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
836
- "language_model.model.layers.7.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
837
- "language_model.model.layers.7.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
838
- "language_model.model.layers.8.input_layernorm.weight": "model-00003-of-00011.safetensors",
839
- "language_model.model.layers.8.mlp.down_proj.weight": "model-00003-of-00011.safetensors",
840
- "language_model.model.layers.8.mlp.gate_proj.weight": "model-00003-of-00011.safetensors",
841
- "language_model.model.layers.8.mlp.up_proj.weight": "model-00003-of-00011.safetensors",
842
- "language_model.model.layers.8.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
843
- "language_model.model.layers.8.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
844
- "language_model.model.layers.8.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
845
- "language_model.model.layers.8.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
846
- "language_model.model.layers.8.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
847
- "language_model.model.layers.9.input_layernorm.weight": "model-00003-of-00011.safetensors",
848
- "language_model.model.layers.9.mlp.down_proj.weight": "model-00003-of-00011.safetensors",
849
- "language_model.model.layers.9.mlp.gate_proj.weight": "model-00003-of-00011.safetensors",
850
- "language_model.model.layers.9.mlp.up_proj.weight": "model-00003-of-00011.safetensors",
851
- "language_model.model.layers.9.post_attention_layernorm.weight": "model-00003-of-00011.safetensors",
852
- "language_model.model.layers.9.self_attn.k_proj.weight": "model-00003-of-00011.safetensors",
853
- "language_model.model.layers.9.self_attn.o_proj.weight": "model-00003-of-00011.safetensors",
854
- "language_model.model.layers.9.self_attn.q_proj.weight": "model-00003-of-00011.safetensors",
855
- "language_model.model.layers.9.self_attn.v_proj.weight": "model-00003-of-00011.safetensors",
856
- "language_model.model.norm.weight": "model-00010-of-00011.safetensors",
857
- "multi_modal_projector.linear_1.weight": "model-00011-of-00011.safetensors",
858
- "multi_modal_projector.linear_2.weight": "model-00011-of-00011.safetensors"
859
- }
860
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
params.json CHANGED
@@ -8,7 +8,6 @@
8
  "rope_theta": 100000000.0,
9
  "norm_eps": 1e-05,
10
  "vocab_size": 131072,
11
- "max_position_embeddings": 32768,
12
  "multimodal": {
13
  "whisper_model_args": {
14
  "encoder_args": {
@@ -17,17 +16,35 @@
17
  "head_dim": 64,
18
  "hidden_dim": 5120,
19
  "n_heads": 20,
 
 
 
20
  "vocab_size": 51866,
 
 
 
21
  "max_source_positions": 1500,
 
 
22
  "audio_encoding_args": {
23
  "sampling_rate": 16000,
 
24
  "num_mel_bins": 128,
25
  "hop_length": 160,
26
- "window_size": 400
 
27
  }
28
  },
 
 
 
 
 
 
29
  "downsample_args": {
30
- "downsample_factor": 4
 
 
31
  }
32
  }
33
  }
 
8
  "rope_theta": 100000000.0,
9
  "norm_eps": 1e-05,
10
  "vocab_size": 131072,
 
11
  "multimodal": {
12
  "whisper_model_args": {
13
  "encoder_args": {
 
16
  "head_dim": 64,
17
  "hidden_dim": 5120,
18
  "n_heads": 20,
19
+ "n_kv_heads": 20,
20
+ "rope_theta": 1000000.0,
21
+ "norm_eps": 1e-05,
22
  "vocab_size": 51866,
23
+ "tied_embeddings": true,
24
+ "ffn_type": "gelu",
25
+ "conv_strides_str": "1,2",
26
  "max_source_positions": 1500,
27
+ "pos_embed": "learned",
28
+ "freeze_encoder": true,
29
  "audio_encoding_args": {
30
  "sampling_rate": 16000,
31
+ "frame_rate": 12.5,
32
  "num_mel_bins": 128,
33
  "hop_length": 160,
34
+ "window_size": 400,
35
+ "chunk_length_s": 30.0
36
  }
37
  },
38
+ "decoder_args": null,
39
+ "audio_token_id": 24,
40
+ "begin_audio_token_id": 25,
41
+ "audio_special_tokens_in_adapter": false,
42
+ "audio_special_token_ids": null,
43
+ "output_embedding_concat_type": "mlp",
44
  "downsample_args": {
45
+ "downsample_factor": 4,
46
+ "num_blocks": 1,
47
+ "n_transformer_layers_per_block": 0
48
  }
49
  }
50
  }
preprocessor_config.json DELETED
@@ -1,15 +0,0 @@
1
- {
2
- "chunk_length": 30,
3
- "dither": 0.0,
4
- "feature_extractor_type": "WhisperFeatureExtractor",
5
- "feature_size": 128,
6
- "hop_length": 160,
7
- "n_fft": 400,
8
- "n_samples": 480000,
9
- "nb_max_frames": 3000,
10
- "padding_side": "right",
11
- "padding_value": 0.0,
12
- "processor_class": "VoxtralProcessor",
13
- "return_attention_mask": false,
14
- "sampling_rate": 16000
15
- }