--- language: - en - fr - es - de license: mit library_name: transformers tags: - audio - automatic-speech-recognition - transformers.js widget: - example_title: LibriSpeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: LibriSpeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac pipeline_tag: automatic-speech-recognition --- # Whisper-Large-V3-Distil-Multi4-v0.2 A multilingual distilled Whisper model with 2 decoder layers, supporting 4 European languages: English, French, Spanish, and German. The model was trained during my work on [Distil-Large-v3.5](https://huggingface.co/distil-whisper/distil-large-v3.5). A notable feature is its native support for **code-switching**. The model has the ability to switch languages within a single segment transcription by automatically producing a new language token when it detects a language change (as demonstrated in the following example). *The `<|yue|>` language token has been repurposed during training to act as an automatic language detection token that enables code-switching during inference. To use this feature, simply set the language parameter to `cantonese` (used by default).* The model's performance is below both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should investigate better training procedures and possibly incorporate more data to effectively compress multilingual capabilities into a single model. ## Table of Contents - [Usage](#usage) - [Evaluation](#evaluation) ## Usage ```python import torch from datasets import load_dataset from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 # Load model model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi4-v0.2" processor = AutoProcessor.from_pretrained(model_name_or_path) model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype) model.to(device) # Example audio dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test") sample, text = dataset[0]["audio"], dataset[0]["text"] # Ground truth text print(text) # Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, # wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, # qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di # Essen ha dedicato al ruolo della mobile photography nella primavera Araba. # Extract feautres input_features = processor( sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt" ).input_features # Generate tokens predicted_ids = model.generate( input_features.to(device, dtype=torch_dtype), max_new_tokens=128, ) # Detokenize to text transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] print(transcription) # Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. # Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui # est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen # ha dedicado al ruolo de la mobile fotografía en la primavera árabe. # Dive in generated tokens transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0] print(transcription) # <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. # Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui # est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen # ha dedicado al ruolo de la mobile fotografía en la primavera árabe. ``` ## Evaluation ### English | Model | LIUM_tedlium | mcv17 | voxpopuli | fleurs | kensho_spgispeech | librispeech-test_clean | librispeech-test_other | speechcolab_gigaspeech | | ------------------------------------------ | ------------ | ----- | --------- | ------ | ----------------- | ---------------------- | ---------------------- | ---------------------- | | openai/whisper-large-v3 | 10.58 | 10.13 | 8.93 | 5.72 | 2.95 | 1.87 | 3.58 | 10.07 | | openai/whisper-large-v3-turbo | 10.20 | 11.74 | 11.78 | 6.13 | 2.95 | 1.98 | 3.94 | 10.11 | | distil-whisper/distil-large-v3 | 8.93 | 12.41 | 7.72 | 7.59 | 3.25 | 2.42 | 5.11 | 10.08 | | distil-whisper/distil-large-v3.5 | 8.65 | 11.07 | 7.54 | 6.74 | 2.86 | 2.28 | 4.94 | 9.84 | | bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 8.88 | 11.33 | 7.60 | 6.97 | 3.03 | 2.51 | 5.24 | 10.12 | | bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 9.36 | 11.32 | 7.65 | 7.02 | 2.99 | 2.46 | 5.24 | 10.06 | ### French | Model | mcv17 | mls | voxpopuli | mtedx | af_accented | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | | ------------------------------------------- | ----- | ---- | --------- | ----- | ----------- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | | openai/whisper-large-v3 | 10.98 | 4.69 | 11.15 | 8.67 | 7.51 | 5.4 | 9.87 | 8.97 | 9 | 8.01 | | openai/whisper-large-v3-turbo | 12.41 | 5.1 | 12.21 | 9.87 | 8.37 | 5.48 | 10.12 | 9 | 8.49 | 8.39 | | bofenghuang/whisper_large_v3_distil_fr_v0.2 | 11.1 | 5 | 10.68 | 8.75 | 7.09 | 6.35 | 9.44 | 9.84 | 8.94 | 8.93 | | bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 11.96 | 6.04 | 11.07 | 9.16 | 7.99 | 7.10 | 10.42 | 12.61 | 9.06 | 11.75 | | bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 12.19 | 6.2 | 11.29 | 9.13 | 8.26 | 7.17 | 10.04 | 12.26 | 8.93 | 11.56 | ### Spanish | Model | mcv17 | mls | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | | ------------------------------------------ | ----- | ---- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | | openai/whisper-large-v3 | 4.91 | 3.97 | 11.06 | 6.52 | 4.22 | 10.85 | 10.36 | 5.90 | 5.22 | | openai/whisper-large-v3-turbo | 5.74 | 4.41 | 16.02 | 6.66 | 4.59 | 11.55 | 10.68 | 6.46 | 5.41 | | bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 5.58 | 4.34 | 8.52 | 7.43 | 5.20 | 11.26 | 13.43 | 5.69 | 8.95 | | bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 5.70 | 4.35 | 8.55 | 7.56 | 5.15 | 11.45 | 13.54 | 5.84 | 8.27 | ### German | Model | mcv17 | mls | voxpopuli | mtedx | fleurs | hf_dev_data_chunk30 | hf_dev_data_sequential | mtedx_chunk30 | mtedx_sequential | | ------------------------------------------ | ----- | ---- | --------- | ----- | ------ | ------------------- | ---------------------- | ------------- | ---------------- | | openai/whisper-large-v3 | 6.11 | 5.60 | 17.75 | 19.63 | 5.92 | 11.21 | 10.35 | 17.64 | 17.76 | | openai/whisper-large-v3-turbo | 7.45 | 6.43 | 20.48 | 20.00 | 6.45 | 10.57 | 9.70 | 18.04 | 18.37 | | bofenghuang/whisper-large-v3-distil-multi4-v0.2 | 7.31 | 6.45 | 12.41 | 21.48 | 8.20 | 11.04 | 13.55 | 19.54 | 21.76 | | bofenghuang/whisper-large-v3-distil-multi7-v0.2 | 7.57 | 6.67 | 12.42 | 21.95 | 8.28 | 11.21 | 13.84 | 19.90 | 21.67 |