Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,56 @@
|
|
1 |
-
---
|
2 |
-
license:
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- zh
|
5 |
+
- en
|
6 |
+
library_name: transformers
|
7 |
+
tags:
|
8 |
+
- qwen2.5
|
9 |
+
- audio
|
10 |
+
- open-source
|
11 |
+
- thinker
|
12 |
+
pipeline_tag: text-generation
|
13 |
+
model_type: qwen2_5_omni
|
14 |
+
base_model: Qwen/Qwen2.5-Omni-7B
|
15 |
+
---
|
16 |
+
|
17 |
+
# AudioOnlyThinker
|
18 |
+
|
19 |
+
This model is a lightweight variant of [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B), customized to **remove the vision encoder** and support only **audio and text**.
|
20 |
+
|
21 |
+
It is intended for use in audio-to-text instruction following, voice chat, and ASR-style tasks, and supports generation through `generate()` as with any decoder-only model.
|
22 |
+
|
23 |
+
## 🔧 How this model was built
|
24 |
+
|
25 |
+
We extracted only the `Thinker` component from the full Qwen2.5-Omni model:
|
26 |
+
|
27 |
+
- ✅ Kept: Audio encoder (`audio_tower`) + Language model (`model`)
|
28 |
+
- ❌ Removed: Vision encoder (`visual`) + Talker (speech decoder)
|
29 |
+
- ✅ Manually deleted `vision_config` from `config.json`
|
30 |
+
- ✅ Class modified via subclassing `Qwen2_5OmniThinkerForConditionalGeneration`
|
31 |
+
|
32 |
+
## 📦 Usage
|
33 |
+
|
34 |
+
```python
|
35 |
+
from transformers import AutoModelForCausalLM, Qwen2_5OmniProcessor
|
36 |
+
|
37 |
+
model = AutoModelForCausalLM.from_pretrained("chunhuizng/AudioOnlyThinker")
|
38 |
+
processor = Qwen2_5OmniProcessor.from_pretrained("chunhuizng/AudioOnlyThinker")
|
39 |
+
|
40 |
+
conversation = [
|
41 |
+
{
|
42 |
+
"role": "user",
|
43 |
+
"content": [
|
44 |
+
{"type": "audio", "path": "example.wav"},
|
45 |
+
{"type": "text", "text": "What is being said in this audio?"}
|
46 |
+
]
|
47 |
+
}
|
48 |
+
]
|
49 |
+
|
50 |
+
inputs = processor.apply_chat_template(conversation, tokenize=True, return_tensors="pt")
|
51 |
+
outputs = model.generate(**inputs, max_new_tokens=128)
|
52 |
+
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
53 |
+
|
54 |
+
---
|
55 |
+
license: mit
|
56 |
+
---
|