--- license: gemma library_name: transformers base_model: google/gemma-3-4b-it datasets: - junnei/covost2 metrics: - bleu - cer - wer pipeline_tag: automatic-speech-recognition --- # Gemma 3 MM model card **Terms of Use**: [Terms][terms] [terms]: https://ai.google.dev/gemma/terms ## Model Summary **Gemma-3-MM** is a open multimodal instruction models that extend the capabilities of the original Gemma-3 models to **include speech processing.** These models leverage the language and vision research used in the original Gemma-3 models and incorporate **additional speech processing capabilities** through a Speech Adapter. The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model). ## Evaluation Model evaluation metrics and results. Here is [Script][Script] to evaluate model. [Korean Branch]: https://huggingface.co/junnei/gemma-3-4b-it-speech/tree/korean [Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py [Covost2]: https://huggingface.co/datasets/junnei/covost2 [Covost2-ko]: https://huggingface.co/datasets/junnei/covost2 [LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr [Fleurs]: https://huggingface.co/datasets/google/fleurs [Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean [Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json [Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json [Link3]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_clean_en_us_to_ko_kr.json [Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json [Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json [Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json [Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json [Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json [Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json ### ASR | Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ | Result | | -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:| | [Covost2][Covost2] | ASR (English) | **86.09** | **4.12** | **7.83** | [Link][Link1] | | [Fleurs][Fleurs] | ASR (English) | **89.61** | **2.28** | **5.23** | [Link][Link2] | | [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] | | [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] | ### AST | Benchmark | Task | BLEU ↑ | Result | | ------------------------------ |-------------------------------|:-------------:|:-------------:| | [Covost2][Covost2] | AST (0-shot, English-Korean) | 31.55 | [Link][Link5] | | [Fleurs][Fleurs] | AST (0-shot, English-Korean) | 11.05 | [Link][Link6] | #### (Experimental) ASR : [Korean Branch][Korean Branch] Score is lower because Korean Normalizer is not applied | Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ | Result | | -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:| | [Zeroth][Zeroth] | ASR (Korean) | **94.91** | **1.31** | **2.50** | [Link][Link7] | | [Fleurs][Fleurs] | ASR (Korean) | **62.83** | **9.08** | **23.0** | [Link][Link8] | | [Covost2][Covost2] | ASR (Korean) | **43.66** | **22.5** | **41.4** | [Link][Link9] | ## Model Details [junnei]: https://huggingface.co/junnei Developed by: [junnei][junnei] Model type: Multimodal (Text, Vision, Speech) Language Model Language(s): Multilingual License: [Gemma](https://ai.google.dev/gemma/terms) Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct) ## Training Details - The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model. - Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU. - The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.** ## Datasets ### ASR / AST - [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko] - [LibriSpeech][LibriSpeech] - [Fleurs][Fleurs] - [Zeroth][Zeroth] ## Limitations Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use. To improve the model's performance and reliability, the following areas need further development: - More computational resources for extended training needed. - For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).** - Due to the lack of computing resources, this model **primarily recognizes audio files less than 30 seconds** in duration. As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs. - If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks. ### Usage Below, there are some code snippets on how to get quickly started with running the model. First, upgrade your Transformers library. AudioInput for chat_template is supported now. ```sh $ pip install -U transformers ``` Then, copy the snippet from the section that is relevant for your use case. #### Running the model with chat_template ```python from transformers import AutoProcessor, AutoModel import torch model_id = "junnei/gemma-3-4b-it-speech" revision = "main" # or "korean". model = AutoModel.from_pretrained( model_id, device_map="auto", revision = revision, trust_remote_code=True ).eval() processor = AutoProcessor.from_pretrained( model_id, revision = revision, trust_remote_code=True ) messages = [ { "role": "user", "content": [ {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"}, {"type": "text", "text": "Transcribe this audio clip into text."} ] } ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt" ) with torch.inference_mode(): generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(response) # What is shown in this image? ``` #### Running the model with raw data ```python from io import BytesIO from urllib.request import urlopen import soundfile from PIL import Image # get Audio data from URL url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav" audio, sr = soundfile.read(BytesIO(urlopen(url).read())) audio_token = '' messages = [ {'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'}, ] prompt = processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt") with torch.inference_mode(): generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(response) ``` ### Finetune the model [Finetune]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/finetune_speech.py Here is finetuning script : [Link][Finetune] **You must change output_dir, upload_dir and fit your Datasets** ```bash python finetune_speech.py ``` ### Citation ```none @article{gemma3mm_2025, title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities}, author={Seongjun Jang}, year={2025} } ```