library_name: transformers
tags:
- generated_from_trainer
model-index:
- name: Llama-speechlmm-1.0-xl
results: []
Model information
The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:
- Machine Translation
- Automatic Speech Recognition
- Speech Translation
- Speech Summarization
- Spoken Question Answering
- Spoken Language Understanding (beta)
- Visual Speech Recognition (beta)
Model Developer: Meetween consortium
Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.
Model Release Date: Feb 28, 2025
License: see LICENSE
Model Architecture
SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder (SeamlessM4T v2) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder (Auto-AVSR) and a video adapter.
Model | Params | Input modalities | Output modalities | Context Length |
---|---|---|---|---|
SpeechLMM 1.0 S | 2B (2.17B) | Multilingual text and audio, English video | Multilingual Text | 128k |
SpeechLMM 1.0 M | 4B (4.15B) | Multilingual text and audio, English video | Multilingual Text | 128k |
SpeechLMM 1.0 L | 9B (8.98B) | Multilingual text and audio, English video | Multilingual Text | 128k |
SpeechLMM 1.0 XL (beta) | 71B (71.5B) | Multilingual text and audio, English video | Multilingual Text | 128k |
Audio and video encoders
For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (facebook/seamless-m4t-v2-large
) and the video encoder is Auto-AVSR (vsr_trlrs3vox2_base
).
Audio and video adapters
For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:
Modality | Architecture | Number of layers | Compression factor |
---|---|---|---|
Audio | MLP | 4 | 1 |
Video | Window-level Q-former (4 queries) |
4 | 4 |
LLM backbone
Model | Backbone |
---|---|
SpeechLMM 1.0 S | Llama 3.2 1B Instruct |
SpeechLMM 1.0 M | Llama 3.2 3B Instruct |
SpeechLMM 1.0 L | Llama 3.1 8B Instruct |
SpeechLMM 1.0 XL (beta) | Llama 3.3 70B Instruct |
How to use
Currently, this model can only be used via our speechlmm
codebase. Refer to the instructions there for more details.
Important: before you can use this model, you must follow these steps:
- Download the SeamlessM4T v2 speech encoder weights:
from transformers import AutoProcessor, SeamlessM4Tv2Model processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large") model = AutoModel.from_pretrained("facebook/seamless-m4t-v2-large") processor.save_pretrained("path/to/some_directory_1") model.speech_encoder.save_pretrained("path/to/some_directory_1")
- Go to
config.json
and change theaudio_encoder._name_or_path
topath/to/some_directory_1
- Download the Auto-AVSR video encoder weights from here and put them in
path/to/some_directory_2
- Go to
config.json
and change thevideo_encoder._name_or_path
topath/to/some_directory_2/vsr_trlrs3vox2_base.pth
Training Data
Monolingual
TASK | Task name | Dataset | License |
---|---|---|---|
ASR | Automatic Speech Recognition | LibriHeavy | CC-BY-4.0 |
LibriTTS | CC BY 4.0 | ||
AMI | CC-BY-4.0 | ||
ICSI | CC-BY-4.0 | ||
VSR | Visual Speech Recognition | LRS2-BBC | Custom |
SSUM | Speech Summarization | AMI | CC-BY-4.0 |
ICSI | CC-BY-4.0 | ||
SQA | Spoken Question Answering | Spoken SQUAD | CC-BY-SA-4.0 |
SLU | Spoken Language Understanding | SLURP | CC BY 4.0 (text) CC BY-NC 4.0 (audio) |
Multilingual
TASK | Task name | Dataset | License |
---|---|---|---|
ASR | Automatic Speech Recognition | CoVoST2 | CC0 |
CommonVoice | Apache-2.0 | ||
ST | Speech-to-text Translation | CoVoST2 | CC0 |
EuroParl-ST | CC-BY-NC-4.0 | ||
MT | Machine Translation | EuroParl-ST | CC-BY-NC-4.0 |
TextInstruct | Text Instruction Following | Everything_Instruct_Multilingual | Apache-2.0 |
SLU | Spoken Language Understanding | Speech-Massive | CC-BY-NC-SA-4.0 |
Evaluation Results
Results for the XL model are coming soon...
Framework versions
- Transformers 4.45.0
- Pytorch 2.3.1+cu124.post2
- Datasets 3.2.0
- Tokenizers 0.20.0