stp99's picture
Update README.md
8d29c26 verified
|
raw
history blame
7.05 kB
metadata
library_name: transformers
tags:
  - generated_from_trainer
model-index:
  - name: Llama-speechlmm-1.0-xl
    results: []

Model information

The SpeechLMM 1.0 collection of multimodal and multilingual large language models is a collection of instruction-tuned generative models in 4 different sizes: S (2B), M (4B), L (9B) and XL (71B), supporting text, audio and video as input and only text as output. The SpeechLMM 1.0 models are optimized for various X-to-text generation tasks, namely:

  • Machine Translation
  • Automatic Speech Recognition
  • Speech Translation
  • Speech Summarization
  • Spoken Question Answering
  • Spoken Language Understanding (beta)
  • Visual Speech Recognition (beta)

Model Developer: Meetween consortium

Supported Languages: English, French, Italian, German, and Spanish are officially supported (for a subset of the supported tasks). The Llama 3.X backbone and the SeamlessM4T v2 audio encoder have been trained on a broader collection of languages than these 5 supported languages, so the model might exhibit good performance on other languages too.

Model Release Date: Feb 28, 2025

License: see LICENSE

Model Architecture

SpeechLMM 1.0 an auto-regressive multimodal language model based on a Llama 3.X backbone (X varies with the model size), a speech-specific stack consisting of a pre-trained audio encoder (SeamlessM4T v2) and an audio adapter, and a video-specific stack consisting of a pre-trained video encoder (Auto-AVSR) and a video adapter.

Model Params Input modalities Output modalities Context Length
SpeechLMM 1.0 S 2B (2.17B) Multilingual text and audio, English video Multilingual Text 128k
SpeechLMM 1.0 M 4B (4.15B) Multilingual text and audio, English video Multilingual Text 128k
SpeechLMM 1.0 L 9B (8.98B) Multilingual text and audio, English video Multilingual Text 128k
SpeechLMM 1.0 XL (beta) 71B (71.5B) Multilingual text and audio, English video Multilingual Text 128k

Audio and video encoders

For all the 4 sizes of SpeechLMM 1.0, the audio encoder is SeamlessM4T v2 Large (facebook/seamless-m4t-v2-large) and the video encoder is Auto-AVSR (vsr_trlrs3vox2_base).

Audio and video adapters

For all the 4 sizes of SpeechLMM 1.0, the audio and video adapters are:

Modality Architecture Number of layers Compression factor
Audio MLP 4 1
Video Window-level Q-former
(4 queries)
4 4

LLM backbone

Model Backbone
SpeechLMM 1.0 S Llama 3.2 1B Instruct
SpeechLMM 1.0 M Llama 3.2 3B Instruct
SpeechLMM 1.0 L Llama 3.1 8B Instruct
SpeechLMM 1.0 XL (beta) Llama 3.3 70B Instruct

How to use

Currently, this model can only be used via our speechlmm codebase. Refer to the instructions there for more details.

Important: before you can use this model, you must follow these steps:

  1. Download the SeamlessM4T v2 speech encoder weights:
    from transformers import AutoProcessor, SeamlessM4Tv2Model
    
    processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
    model = AutoModel.from_pretrained("facebook/seamless-m4t-v2-large")
    
    processor.save_pretrained("path/to/some_directory_1")
    model.speech_encoder.save_pretrained("path/to/some_directory_1")
    
  2. Go to config.json and change the audio_encoder._name_or_path to path/to/some_directory_1
  3. Download the Auto-AVSR video encoder weights from here and put them in path/to/some_directory_2
  4. Go to config.json and change the video_encoder._name_or_path to path/to/some_directory_2/vsr_trlrs3vox2_base.pth

Training Data

Monolingual

TASK Task name Dataset License
ASR Automatic Speech Recognition LibriHeavy CC-BY-4.0
LibriTTS CC BY 4.0
AMI CC-BY-4.0
ICSI CC-BY-4.0
VSR Visual Speech Recognition LRS2-BBC Custom
SSUM Speech Summarization AMI CC-BY-4.0
ICSI CC-BY-4.0
SQA Spoken Question Answering Spoken SQUAD CC-BY-SA-4.0
SLU Spoken Language Understanding SLURP CC BY 4.0 (text)
CC BY-NC 4.0 (audio)

Multilingual

TASK Task name Dataset License
ASR Automatic Speech Recognition CoVoST2 CC0
CommonVoice Apache-2.0
ST Speech-to-text Translation CoVoST2 CC0
EuroParl-ST CC-BY-NC-4.0
MT Machine Translation EuroParl-ST CC-BY-NC-4.0
TextInstruct Text Instruction Following Everything_Instruct_Multilingual Apache-2.0
SLU Spoken Language Understanding Speech-Massive CC-BY-NC-SA-4.0

Evaluation Results

Results for the XL model are coming soon...

Framework versions

  • Transformers 4.45.0
  • Pytorch 2.3.1+cu124.post2
  • Datasets 3.2.0
  • Tokenizers 0.20.0