Qwen
/

Qwen2.5-Omni-7B

@@ -11,12 +11,12 @@ pipeline_tag: any-to-any
 ---
 # Qwen2.5-Omni
-<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
     <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
 </a>
-## OverView
 ### Introduction
 Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
@@ -600,7 +600,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
 Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
 ```
 pip uninstall transformers
-pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
 pip install accelerate
 ```
 or you might encounter the following error:
@@ -613,10 +613,10 @@ We offer a toolkit to help you handle various types of audio and visual input mo
 ```bash
 # It's highly recommended to use `[decord]` feature for faster video loading.
-pip install qwen-omni-utils[decord]
 ```
-If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
 ### 🤗  Transformers Usage
@@ -625,14 +625,14 @@ Here we show a code snippet to show you how to use the chat model with `transfor
 ```python
 import soundfile as sf
-from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
 from qwen_omni_utils import process_mm_info
 # default: Load the model on the available device(s)
-model = Qwen2_5OmniModel.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
 # We recommend enabling flash_attention_2 for better acceleration and memory saving.
-# model = Qwen2_5OmniModel.from_pretrained(
 #     "Qwen/Qwen2.5-Omni-7B",
 #     torch_dtype="auto",
 #     device_map="auto",
@@ -660,7 +660,7 @@ USE_AUDIO_IN_VIDEO = True
 # Preparation for inference
 text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
 audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
-inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
 inputs = inputs.to(model.device).to(model.dtype)
 # Inference: Generation of the output text and audio
@@ -687,7 +687,7 @@ Note: The table above presents the theoretical minimum memory requirements for i
 </details>
 <details>
-<summary>Video ULR resource usage</summary>
 Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
@@ -774,7 +774,7 @@ USE_AUDIO_IN_VIDEO = True
 text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
 audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
-inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
 inputs = inputs.to(model.device).to(model.dtype)
 # Batch Inference
@@ -802,7 +802,7 @@ audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
 ```
 ```python
 # second place, in model processor
-inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt",
                    padding=True, use_audio_in_video=True)
 ```
 ```python
@@ -813,24 +813,23 @@ It is worth noting that during a multi-round conversation, the `use_audio_in_vid
 #### Use audio output or not
-The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output=False` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
 ```python
-model = Qwen2_5OmniModel.from_pretrained(
     "Qwen/Qwen2.5-Omni-7B",
     torch_dtype="auto",
-    device_map="auto",
-    enable_audio_output=False,
 )
 ```
-In order to obtain a flexible experience, we recommend that users set `enable_audio_output` at `True` when initializing the model through `from_pretrained` function, and then decide whether to return audio when `generate` function is called. When `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.
 ```python
-model = Qwen2_5OmniModel.from_pretrained(
     "Qwen/Qwen2.5-Omni-7B",
     torch_dtype="auto",
-    device_map="auto",
-    enable_audio_output=True,
 )
 ...
 text_ids = model.generate(**inputs, return_audio=False)
@@ -844,14 +843,14 @@ Qwen2.5-Omni supports the ability to change the voice of the output audio. The `
 | Chelsie    | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
 | Ethan      | Male   | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
-Users can use the `spk` parameter of `generate` function to specify the voice type. By default, if `spk` is not specified, the default voice type is `Chelsie`.
 ```python
-text_ids, audio = model.generate(**inputs, spk="Chelsie")
 ```
 ```python
-text_ids, audio = model.generate(**inputs, spk="Ethan")
 ```
 #### Flash-Attention 2 to speed up generation
@@ -867,9 +866,9 @@ Also, you should have hardware that is compatible with FlashAttention 2. Read mo
 To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
 ```python
-from transformers import Qwen2_5OmniModel
-model = Qwen2_5OmniModel.from_pretrained(
     "Qwen/Qwen2.5-Omni-7B",
     device_map="auto",
     torch_dtype=torch.bfloat16,

 ---
 # Qwen2.5-Omni
+<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
     <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
 </a>
+## Overview
 ### Introduction
 Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
 Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
 ```
 pip uninstall transformers
+pip install git+https://github.com/BakerBunker/transformers@21dbefaa54e5bf180464696aa70af0bfc7a61d53
 pip install accelerate
 ```
 or you might encounter the following error:
 ```bash
 # It's highly recommended to use `[decord]` feature for faster video loading.
+pip install qwen-omni-utils[decord] -U
 ```
+If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
 ### 🤗  Transformers Usage
 ```python
 import soundfile as sf
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
 from qwen_omni_utils import process_mm_info
 # default: Load the model on the available device(s)
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
 # We recommend enabling flash_attention_2 for better acceleration and memory saving.
+# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
 #     "Qwen/Qwen2.5-Omni-7B",
 #     torch_dtype="auto",
 #     device_map="auto",
 # Preparation for inference
 text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
 audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
 inputs = inputs.to(model.device).to(model.dtype)
 # Inference: Generation of the output text and audio
 </details>
 <details>
+<summary>Video URL resource usage</summary>
 Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
 text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
 audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
+inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
 inputs = inputs.to(model.device).to(model.dtype)
 # Batch Inference
 ```
 ```python
 # second place, in model processor
+inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt",
                    padding=True, use_audio_in_video=True)
 ```
 ```python
 #### Use audio output or not
+The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disable_talker()` after init the model. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
 ```python
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
     "Qwen/Qwen2.5-Omni-7B",
     torch_dtype="auto",
+    device_map="auto"
 )
+model.disable_talker()
 ```
+In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.
 ```python
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
     "Qwen/Qwen2.5-Omni-7B",
     torch_dtype="auto",
+    device_map="auto"
 )
 ...
 text_ids = model.generate(**inputs, return_audio=False)
 | Chelsie    | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
 | Ethan      | Male   | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
+Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`.
 ```python
+text_ids, audio = model.generate(**inputs, speaker="Chelsie")
 ```
 ```python
+text_ids, audio = model.generate(**inputs, speaker="Ethan")
 ```
 #### Flash-Attention 2 to speed up generation
 To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
 ```python
+from transformers import Qwen2_5OmniForConditionalGeneration
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
     "Qwen/Qwen2.5-Omni-7B",
     device_map="auto",
     torch_dtype=torch.bfloat16,