---
language:
- en
- fr
- es
- de
license: mit
library_name: transformers
tags:
- audio
- automatic-speech-recognition
- transformers.js
widget:
- example_title: LibriSpeech sample 1
  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: LibriSpeech sample 2
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
pipeline_tag: automatic-speech-recognition
---

# Whisper-Large-V3-Distil-Multi4-v0.2

A multilingual distilled Whisper model with 2 decoder layers, supporting 4 European languages: English, French, Spanish, and German.

The model was trained during my work on [Distil-Large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5).

A notable feature is its native support for **code-switching**. The model has the ability to switch languages within a single segment transcription by automatically producing a new language token when it detects a language change (as demonstrated in the following example).

*The `<|yue|>` language token has been repurposed during training to act as an automatic language detection token that enables code-switching during inference. To use this feature, simply set the language parameter to `cantonese` (used by default).*

The model's performance is below both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should investigate better training procedures and possibly incorporate more data to effectively compress multilingual capabilities into a single model.

## Table of Contents

- [Usage](#usage)
- [Evaluation](#evaluation)

## Usage

```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi4-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]

# Ground truth text
print(text)
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, 
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, 
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di 
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba.

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(device, dtype=torch_dtype),
    max_new_tokens=128,
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
#  Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

# Dive in generated tokens
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.
```

## Evaluation

### English

| Model                                      | LIUM_tedlium | mcv17 | voxpopuli | fleurs | kensho_spgispeech | librispeech-test_clean | librispeech-test_other | speechcolab_gigaspeech |
| ------------------------------------------ | ------------ | ----- | --------- | ------ | ----------------- | ---------------------- | ---------------------- | ---------------------- |
| openai/whisper-large-v3                    | 10.58        | 10.13 | 8.93      | 5.72   | 2.95              | 1.87                   | 3.58                   | 10.07                  |
| openai/whisper-large-v3-turbo              | 10.20        | 11.74 | 11.78     | 6.13   | 2.95              | 1.98                   | 3.94                   | 10.11                  |
| distil-whisper/distil-large-v3             | 8.93         | 12.41 | 7.72      | 7.59   | 3.25              | 2.42                   | 5.11                   | 10.08                  |
| distil-whisper/distil-large-v3.5           | 8.65         | 11.07 | 7.54      | 6.74   | 2.86              | 2.28                   | 4.94                   | 9.84                   |
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 8.88         | 11.33 | 7.60      | 6.97   | 3.03              | 2.51                   | 5.24                   | 10.12                  |
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 9.36         | 11.32 | 7.65      | 7.02   | 2.99              | 2.46                   | 5.24                   | 10.06                  |

### French

| Model                                       | mcv17 | mls  | voxpopuli | mtedx | af_accented | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential |
| ------------------------------------------- | ----- | ---- | --------- | ----- | ----------- | ------ | ------------------- | ---------------------- | ------------- | ---------------- |
| openai/whisper-large-v3                     | 10.98 | 4.69 | 11.15     | 8.67  | 7.51        | 5.4    | 9.87                | 8.97                   | 9             | 8.01             |
| openai/whisper-large-v3-turbo               | 12.41 | 5.1  | 12.21     | 9.87  | 8.37        | 5.48   | 10.12               | 9                      | 8.49          | 8.39             |
| bofenghuang/whisper_large_v3_distil_fr_v0.2 | 11.1  | 5    | 10.68     | 8.75  | 7.09        | 6.35   | 9.44                | 9.84                   | 8.94          | 8.93             |
| bofenghuang/whisper-large-v3-distil-multi4-v0.2  | 11.96 | 6.04 | 11.07     | 9.16  | 7.99        | 7.10   | 10.42               | 12.61                  | 9.06          | 11.75            |
| bofenghuang/whisper-large-v3-distil-multi7-v0.2  | 12.19 | 6.2  | 11.29     | 9.13  | 8.26        | 7.17   | 10.04               | 12.26                  | 8.93          | 11.56            |

### Spanish

| Model                                      | mcv17 | mls  | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential |
| ------------------------------------------ | ----- | ---- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- |
| openai/whisper-large-v3                    | 4.91  | 3.97 | 11.06     | 6.52  | 4.22   | 10.85               | 10.36                  | 5.90          | 5.22             |
| openai/whisper-large-v3-turbo              | 5.74  | 4.41 | 16.02     | 6.66  | 4.59   | 11.55               | 10.68                  | 6.46          | 5.41             |
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 5.58  | 4.34 | 8.52      | 7.43  | 5.20   | 11.26               | 13.43                  | 5.69          | 8.95             |
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 5.70  | 4.35 | 8.55      | 7.56  | 5.15   | 11.45               | 13.54                  | 5.84          | 8.27             |

### German

| Model                                      | mcv17 | mls  | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential |
| ------------------------------------------ | ----- | ---- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- |
| openai/whisper-large-v3                    | 6.11  | 5.60 | 17.75     | 19.63 | 5.92   | 11.21               | 10.35                  | 17.64         | 17.76            |
| openai/whisper-large-v3-turbo              | 7.45  | 6.43 | 20.48     | 20.00 | 6.45   | 10.57               | 9.70                   | 18.04         | 18.37            |
| bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 7.31  | 6.45 | 12.41     | 21.48 | 8.20   | 11.04               | 13.55                  | 19.54         | 21.76            |
| bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 7.57  | 6.67 | 12.42     | 21.95 | 8.28   | 11.21               | 13.84                  | 19.90         | 21.67            |