Super-squash branch 'main' using huggingface_hub

Browse files

Co-authored-by: sanchit-gandhi <sanchit-gandhi@users.noreply.huggingface.co>
Co-authored-by: pandora-s <pandora-s@users.noreply.huggingface.co>

Files changed (5) hide show

.gitattributes +37 -0
README.md +242 -0
consolidated.safetensors +3 -0
params.json +34 -0
tekken.json +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,37 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tekken[[:space:]](19).json filter=lfs diff=lfs merge=lfs -text
+tekken.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,242 @@

+---
+language:
+- en
+- fr
+- de
+- es
+- it
+- pt
+- nl
+- hi
+license: apache-2.0
+library_name: vllm
+inference: false
+base_model:
+- mistralai/Mistral-Small-24B-Base-2501
+extra_gated_description: >-
+  If you want to learn more about how we process your personal data, please read
+  our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
+pipeline_tag: audio-text-to-text
+---
+# Voxtral Small 1.0 (24B) - 2507
+Voxtral Small is an enhancement of [Mistral Small 3](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
+Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral-2507).
+## Key Features
+Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
+- **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
+- **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
+- **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
+- **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
+- **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
+- **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1
+## Benchmark Results
+### Audio
+Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
+### Text
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/9i9jpA2OLwhYut_wet836.png)
+## Usage
+The model can be used with the following frameworks;
+- [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
+**Notes**:
+- `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
+- Multiple audios per message and multiple user turns with audio are supported
+- Function calling is supported
+- System prompts are not yet supported
+### vLLM (recommended)
+We recommend using this model with [vLLM](https://github.com/vllm-project/vllm).
+#### Installation
+Make sure to install vllm from "main":
+```
+pip install -U vllm\[audio\] \
+    --pre \
+    --extra-index-url https://wheels.vllm.ai/nightly
+```
+Doing so should automatically install [`mistral_common >= 1.8.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.8.0).
+To check:
+```
+python -c "import mistral_common; print(mistral_common.__version__)"
+```
+#### Offline
+You can test that your vLLM setup works as expected by cloning the vLLM repo:
+```sh
+git clone https://github.com/vllm-project/vllm && cd vllm
+```
+and then running:
+```sh
+python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
+```
+#### Serve
+We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
+1. Spin up a server:
+```
+vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2
+```
+**Note:** Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
+2. To ping the client you can use a simple Python snippet. See the following examples.
+### Audio Instruct
+Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
+Make sure that your client has `mistral-common` with audio installed:
+```sh
+pip install --upgrade mistral_common\[audio\]
+```
+<details>
+  <summary>Python snippet</summary>
+```py
+from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
+from mistral_common.audio import Audio
+from huggingface_hub import hf_hub_download
+from openai import OpenAI
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+models = client.models.list()
+model = models.data[0].id
+obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
+bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
+def file_to_chunk(file: str) -> AudioChunk:
+    audio = Audio.from_file(file, strict=False)
+    return AudioChunk.from_audio(audio)
+text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
+user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
+print(30 * "=" + "USER 1" + 30 * "=")
+print(text_chunk.text)
+print("\n\n")
+response = client.chat.completions.create(
+    model=model,
+    messages=[user_msg],
+    temperature=0.2,
+    top_p=0.95,
+)
+content = response.choices[0].message.content
+print(30 * "=" + "BOT 1" + 30 * "=")
+print(content)
+print("\n\n")
+# The model could give the following answer:
+# ```L'orateur le plus inspirant est le président.
+# Il est plus inspirant parce qu'il parle de ses expériences personnelles
+# et de son optimisme pour l'avenir du pays.
+# Il est différent de l'autre orateur car il ne parle pas de la météo,
+# mais plutôt de ses interactions avec les gens et de son rôle en tant que président.```
+messages = [
+    user_msg,
+    AssistantMessage(content=content).to_openai(),
+    UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
+]
+print(30 * "=" + "USER 2" + 30 * "=")
+print(messages[-1]["content"])
+print("\n\n")
+response = client.chat.completions.create(
+    model=model,
+    messages=messages,
+    temperature=0.2,
+    top_p=0.95,
+)
+content = response.choices[0].message.content
+print(30 * "=" + "BOT 2" + 30 * "=")
+print(content)
+```
+</details>
+#### Transcription
+Voxtral-Small-24B-2507 has powerful transcription capabilities!
+Make sure that your client has `mistral-common` with audio installed:
+```sh
+pip install --upgrade mistral_common\[audio\]
+```
+<details>
+  <summary>Python snippet</summary>
+```python
+from mistral_common.protocol.transcription.request import TranscriptionRequest
+from mistral_common.protocol.instruct.messages import RawAudio
+from mistral_common.audio import Audio
+from huggingface_hub import hf_hub_download
+from openai import OpenAI
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://<your-server-host>:8000/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+models = client.models.list()
+model = models.data[0].id
+obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
+audio = Audio.from_file(obama_file, strict=False)
+audio = RawAudio.from_audio(audio)
+req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
+response = client.audio.transcriptions.create(**req)
+print(response)
+```
+</details>

consolidated.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:421cd5988a614ac9ecf63eaf0db3572a36a6318dbb3c57cba24d5ef4d4abea7e
+size 48519877672

params.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "dim": 5120,
+  "n_layers": 40,
+  "head_dim": 128,
+  "hidden_dim": 32768,
+  "n_heads": 32,
+  "n_kv_heads": 8,
+  "rope_theta": 100000000.0,
+  "norm_eps": 1e-05,
+  "vocab_size": 131072,
+  "max_position_embeddings": 32768,
+  "multimodal": {
+    "whisper_model_args": {
+      "encoder_args": {
+        "dim": 1280,
+        "n_layers": 32,
+        "head_dim": 64,
+        "hidden_dim": 5120,
+        "n_heads": 20,
+        "vocab_size": 51866,
+        "max_source_positions": 1500,
+        "audio_encoding_args": {
+          "sampling_rate": 16000,
+          "num_mel_bins": 128,
+          "hop_length": 160,
+          "window_size": 400
+        }
+      },
+      "downsample_args": {
+        "downsample_factor": 4
+      }
+    }
+  }
+}

tekken.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4aaf3836c2a5332f029ce85a7a62255c966f47b6797ef81dedd0ade9c862e4a8
+size 14894206