junnei commited on
Commit
ab0b2ea
·
verified ·
1 Parent(s): ede5c16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -13
README.md CHANGED
@@ -29,10 +29,10 @@ capabilities** through a Speech Adapter.
29
 
30
  The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
31
 
32
- The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
33
  with Gemma-3-4b-it-speech being a specific instance of this family.
34
  These models maintain the original Gemma-3 capabilities while adding
35
- multilingual speech recognition and translation abilities.
36
 
37
  ## Evaluation
38
 
@@ -56,17 +56,18 @@ Model evaluation metrics and results.
56
 
57
  ## Model Details
58
 
59
- Developed by: [junnei]()
 
60
 
61
  Model type: Multimodal (Text, Vision, Speech) Language Model
62
 
63
  Language(s): Multilingual
64
 
65
- License: [Gemma]()
66
 
67
- Base model: [google/gemma-3-4b-it]
68
 
69
- Inspiration: [Phi-4-multimodal-instruct]
70
 
71
  ## Training Details
72
 
@@ -74,7 +75,7 @@ Inspiration: [Phi-4-multimodal-instruct]
74
 
75
  - Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
76
 
77
- - The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **2-15 seconds in duration.**
78
 
79
  ## Limitations
80
 
@@ -86,7 +87,7 @@ To improve the model's performance and reliability, the following areas need fur
86
  - For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
87
 
88
  - Due to the lack of computing resources,
89
- this model **primarily recognizes audio files within 2-15 seconds** in duration.
90
  As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
91
 
92
  - If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
@@ -146,7 +147,7 @@ print(response)
146
  ```
147
 
148
 
149
- #### Running the model with local data
150
 
151
  ```python
152
  from io import BytesIO
@@ -181,16 +182,13 @@ with torch.inference_mode():
181
  print(response)
182
  ```
183
 
184
- ## Usage and Limitations
185
-
186
- These models have certain limitations that users should be aware of.
187
 
188
  ### Citation
189
 
190
  ```none
191
  @article{gemma3mm_2025,
192
  title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
193
- author={[junnei]},
194
  year={2025}
195
  }
196
 
 
29
 
30
  The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
31
 
32
+ ~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
33
  with Gemma-3-4b-it-speech being a specific instance of this family.
34
  These models maintain the original Gemma-3 capabilities while adding
35
+ multilingual speech recognition and translation abilities.~~
36
 
37
  ## Evaluation
38
 
 
56
 
57
  ## Model Details
58
 
59
+ [junnei]: https://huggingface.co/junnei
60
+ Developed by: [junnei][junnei]
61
 
62
  Model type: Multimodal (Text, Vision, Speech) Language Model
63
 
64
  Language(s): Multilingual
65
 
66
+ License: [Gemma](https://ai.google.dev/gemma/terms)
67
 
68
+ Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
69
 
70
+ Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
71
 
72
  ## Training Details
73
 
 
75
 
76
  - Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
77
 
78
+ - The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
79
 
80
  ## Limitations
81
 
 
87
  - For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
88
 
89
  - Due to the lack of computing resources,
90
+ this model **primarily recognizes audio files less than 30 seconds** in duration.
91
  As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
92
 
93
  - If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
 
147
  ```
148
 
149
 
150
+ #### Running the model with raw data
151
 
152
  ```python
153
  from io import BytesIO
 
182
  print(response)
183
  ```
184
 
 
 
 
185
 
186
  ### Citation
187
 
188
  ```none
189
  @article{gemma3mm_2025,
190
  title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
191
+ author={Seongjun Jang},
192
  year={2025}
193
  }
194