---
license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
---

# Gemma 3 MM model card


**Terms of Use**: [Terms][terms]

[terms]: https://ai.google.dev/gemma/terms

## Model Summary

**Gemma-3-MM** is a open multimodal instruction models that extend the 
capabilities of the original Gemma-3 models to **include speech processing.**

These models leverage the language and vision research used in the 
original Gemma-3 models and incorporate **additional speech processing 
capabilities** through a Speech Adapter. 

The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).

## Evaluation

Model evaluation metrics and results.

Here is [Script][Script] to evaluate model.

[Korean Branch]: https://huggingface.co/junnei/gemma-3-4b-it-speech/tree/korean
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
[Covost2]: https://huggingface.co/datasets/junnei/covost2
[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
[Fleurs]: https://huggingface.co/datasets/google/fleurs
[Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean

[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
[Link3]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_clean_en_us_to_ko_kr.json
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
[Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json
[Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json
[Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json

### ASR

| Benchmark                        | Task           |     BLEU ↑    |     CER ↓    |     WER ↓    |     Result    |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Covost2][Covost2]               | ASR (English)  |   **86.09**   |   **4.12**   |   **7.83**   | [Link][Link1] |
| [Fleurs][Fleurs]                 | ASR (English)  |   **89.61**   |   **2.28**   |   **5.23**   | [Link][Link2] |
| [LibriSpeech-Clean][LibriSpeech] | ASR (English)  |   **94.28**   |   **0.98**   |   **2.91**   | [Link][Link3] |
| [LibriSpeech-Other][LibriSpeech] | ASR (English)  |   **87.60**   |   **3.10**   |   **6.55**   | [Link][Link4] |

### AST

| Benchmark                      | Task                          |     BLEU ↑    |     Result    |
| ------------------------------ |-------------------------------|:-------------:|:-------------:|
| [Covost2][Covost2]             | AST (0-shot, English-Korean)  |     31.55     | [Link][Link5] |
| [Fleurs][Fleurs]               | AST (0-shot, English-Korean)  |     11.05     | [Link][Link6] |

#### (Experimental) ASR : [Korean Branch][Korean Branch]

Score is lower because Korean Normalizer is not applied

| Benchmark                        | Task           |     BLEU ↑    |     CER ↓    |     WER ↓    |     Result    |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Zeroth][Zeroth]                 | ASR (Korean)   |   **94.91**   |   **1.31**   |   **2.50**   | [Link][Link7] |
| [Fleurs][Fleurs]                 | ASR (Korean)   |   **62.83**   |   **9.08**   |   **23.0**   | [Link][Link8] |
| [Covost2][Covost2]               | ASR (Korean)   |   **43.66**   |   **22.5**   |   **41.4**   | [Link][Link9] |

## Model Details

[junnei]: https://huggingface.co/junnei
Developed by: [junnei][junnei]

Model type: Multimodal (Text, Vision, Speech) Language Model

Language(s): Multilingual

License: [Gemma](https://ai.google.dev/gemma/terms)

Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)

Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)

## Training Details

- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.

- Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.

- The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.**

## Datasets

### ASR / AST

- [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko]
- [LibriSpeech][LibriSpeech]
- [Fleurs][Fleurs]
- [Zeroth][Zeroth]

## Limitations

Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use. 
To improve the model's performance and reliability, the following areas need further development:

- More computational resources for extended training needed.

- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**

- Due to the lack of computing resources, 
this model **primarily recognizes audio files less than 30 seconds** in duration. 
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.

- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.

### Usage

Below, there are some code snippets on how to get quickly started with running the model.

First, upgrade your Transformers library. AudioInput for chat_template is supported now.

```sh
$ pip install -U transformers
```

Then, copy the snippet from the section that is relevant for your use case.

#### Running the model with chat_template

```python
from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Transcribe this audio clip into text."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# What is shown in this image?
```


#### Running the model with raw data

```python
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image


# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'


messages = [
    {'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)


inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)
```

### Finetune the model
[Finetune]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/finetune_speech.py

Here is finetuning script : [Link][Finetune]

**You must change output_dir, upload_dir and fit your Datasets**

```bash
python finetune_speech.py

```


### Citation

```none
@article{gemma3mm_2025,
    title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
    author={Seongjun Jang},
    year={2025}
}

```