Update README.md
Browse files
README.md
CHANGED
@@ -29,10 +29,10 @@ capabilities** through a Speech Adapter.
|
|
29 |
|
30 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
31 |
|
32 |
-
The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
|
33 |
with Gemma-3-4b-it-speech being a specific instance of this family.
|
34 |
These models maintain the original Gemma-3 capabilities while adding
|
35 |
-
multilingual speech recognition and translation abilities
|
36 |
|
37 |
## Evaluation
|
38 |
|
@@ -56,17 +56,18 @@ Model evaluation metrics and results.
|
|
56 |
|
57 |
## Model Details
|
58 |
|
59 |
-
|
|
|
60 |
|
61 |
Model type: Multimodal (Text, Vision, Speech) Language Model
|
62 |
|
63 |
Language(s): Multilingual
|
64 |
|
65 |
-
License: [Gemma]()
|
66 |
|
67 |
-
Base model: [google/gemma-3-4b-it]
|
68 |
|
69 |
-
Inspiration: [Phi-4-multimodal-instruct]
|
70 |
|
71 |
## Training Details
|
72 |
|
@@ -74,7 +75,7 @@ Inspiration: [Phi-4-multimodal-instruct]
|
|
74 |
|
75 |
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
|
76 |
|
77 |
-
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **
|
78 |
|
79 |
## Limitations
|
80 |
|
@@ -86,7 +87,7 @@ To improve the model's performance and reliability, the following areas need fur
|
|
86 |
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
|
87 |
|
88 |
- Due to the lack of computing resources,
|
89 |
-
this model **primarily recognizes audio files
|
90 |
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
|
91 |
|
92 |
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
|
@@ -146,7 +147,7 @@ print(response)
|
|
146 |
```
|
147 |
|
148 |
|
149 |
-
#### Running the model with
|
150 |
|
151 |
```python
|
152 |
from io import BytesIO
|
@@ -181,16 +182,13 @@ with torch.inference_mode():
|
|
181 |
print(response)
|
182 |
```
|
183 |
|
184 |
-
## Usage and Limitations
|
185 |
-
|
186 |
-
These models have certain limitations that users should be aware of.
|
187 |
|
188 |
### Citation
|
189 |
|
190 |
```none
|
191 |
@article{gemma3mm_2025,
|
192 |
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
|
193 |
-
author={
|
194 |
year={2025}
|
195 |
}
|
196 |
|
|
|
29 |
|
30 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
31 |
|
32 |
+
~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
|
33 |
with Gemma-3-4b-it-speech being a specific instance of this family.
|
34 |
These models maintain the original Gemma-3 capabilities while adding
|
35 |
+
multilingual speech recognition and translation abilities.~~
|
36 |
|
37 |
## Evaluation
|
38 |
|
|
|
56 |
|
57 |
## Model Details
|
58 |
|
59 |
+
[junnei]: https://huggingface.co/junnei
|
60 |
+
Developed by: [junnei][junnei]
|
61 |
|
62 |
Model type: Multimodal (Text, Vision, Speech) Language Model
|
63 |
|
64 |
Language(s): Multilingual
|
65 |
|
66 |
+
License: [Gemma](https://ai.google.dev/gemma/terms)
|
67 |
|
68 |
+
Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
|
69 |
|
70 |
+
Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
|
71 |
|
72 |
## Training Details
|
73 |
|
|
|
75 |
|
76 |
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
|
77 |
|
78 |
+
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
|
79 |
|
80 |
## Limitations
|
81 |
|
|
|
87 |
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
|
88 |
|
89 |
- Due to the lack of computing resources,
|
90 |
+
this model **primarily recognizes audio files less than 30 seconds** in duration.
|
91 |
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
|
92 |
|
93 |
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
|
|
|
147 |
```
|
148 |
|
149 |
|
150 |
+
#### Running the model with raw data
|
151 |
|
152 |
```python
|
153 |
from io import BytesIO
|
|
|
182 |
print(response)
|
183 |
```
|
184 |
|
|
|
|
|
|
|
185 |
|
186 |
### Citation
|
187 |
|
188 |
```none
|
189 |
@article{gemma3mm_2025,
|
190 |
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
|
191 |
+
author={Seongjun Jang},
|
192 |
year={2025}
|
193 |
}
|
194 |
|