chunhuizng
/

AudioOnlyThinker

Text Generation

audio-only-thinker

Model card Files Files and versions

chunhuizng commited on Apr 9

Commit

de3ce3c

·

verified ·

1 Parent(s): 66fa2b5

Update README.md

Files changed (1) hide show

README.md +56 -3

README.md CHANGED Viewed

@@ -1,3 +1,56 @@
----
-license: mit
----

+---
+license: apache-2.0
+language:
+  - zh
+  - en
+library_name: transformers
+tags:
+  - qwen2.5
+  - audio
+  - open-source
+  - thinker
+pipeline_tag: text-generation
+model_type: qwen2_5_omni
+base_model: Qwen/Qwen2.5-Omni-7B
+---
+# AudioOnlyThinker
+This model is a lightweight variant of [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B), customized to **remove the vision encoder** and support only **audio and text**.
+It is intended for use in audio-to-text instruction following, voice chat, and ASR-style tasks, and supports generation through `generate()` as with any decoder-only model.
+## 🔧 How this model was built
+We extracted only the `Thinker` component from the full Qwen2.5-Omni model:
+- ✅ Kept: Audio encoder (`audio_tower`) + Language model (`model`)
+- ❌ Removed: Vision encoder (`visual`) + Talker (speech decoder)
+- ✅ Manually deleted `vision_config` from `config.json`
+- ✅ Class modified via subclassing `Qwen2_5OmniThinkerForConditionalGeneration`
+## 📦 Usage
+```python
+from transformers import AutoModelForCausalLM, Qwen2_5OmniProcessor
+model = AutoModelForCausalLM.from_pretrained("chunhuizng/AudioOnlyThinker")
+processor = Qwen2_5OmniProcessor.from_pretrained("chunhuizng/AudioOnlyThinker")
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "path": "example.wav"},
+            {"type": "text", "text": "What is being said in this audio?"}
+        ]
+    }
+]
+inputs = processor.apply_chat_template(conversation, tokenize=True, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=128)
+response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
+---
+license: mit
+---