Update README.md
Browse files
README.md
CHANGED
@@ -11,12 +11,12 @@ pipeline_tag: any-to-any
|
|
11 |
---
|
12 |
|
13 |
# Qwen2.5-Omni
|
14 |
-
<a href="https://chat.
|
15 |
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
16 |
</a>
|
17 |
|
18 |
|
19 |
-
##
|
20 |
### Introduction
|
21 |
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
|
22 |
|
@@ -600,7 +600,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
600 |
Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
|
601 |
```
|
602 |
pip uninstall transformers
|
603 |
-
pip install git+https://github.com/
|
604 |
pip install accelerate
|
605 |
```
|
606 |
or you might encounter the following error:
|
@@ -613,10 +613,10 @@ We offer a toolkit to help you handle various types of audio and visual input mo
|
|
613 |
|
614 |
```bash
|
615 |
# It's highly recommended to use `[decord]` feature for faster video loading.
|
616 |
-
pip install qwen-omni-utils[decord]
|
617 |
```
|
618 |
|
619 |
-
If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
|
620 |
|
621 |
### 🤗 Transformers Usage
|
622 |
|
@@ -625,14 +625,14 @@ Here we show a code snippet to show you how to use the chat model with `transfor
|
|
625 |
```python
|
626 |
import soundfile as sf
|
627 |
|
628 |
-
from transformers import
|
629 |
from qwen_omni_utils import process_mm_info
|
630 |
|
631 |
# default: Load the model on the available device(s)
|
632 |
-
model =
|
633 |
|
634 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
|
635 |
-
# model =
|
636 |
# "Qwen/Qwen2.5-Omni-7B",
|
637 |
# torch_dtype="auto",
|
638 |
# device_map="auto",
|
@@ -660,7 +660,7 @@ USE_AUDIO_IN_VIDEO = True
|
|
660 |
# Preparation for inference
|
661 |
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
|
662 |
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
|
663 |
-
inputs = processor(text=text,
|
664 |
inputs = inputs.to(model.device).to(model.dtype)
|
665 |
|
666 |
# Inference: Generation of the output text and audio
|
@@ -687,7 +687,7 @@ Note: The table above presents the theoretical minimum memory requirements for i
|
|
687 |
</details>
|
688 |
|
689 |
<details>
|
690 |
-
<summary>Video
|
691 |
|
692 |
Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
|
693 |
|
@@ -774,7 +774,7 @@ USE_AUDIO_IN_VIDEO = True
|
|
774 |
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
|
775 |
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
|
776 |
|
777 |
-
inputs = processor(text=text,
|
778 |
inputs = inputs.to(model.device).to(model.dtype)
|
779 |
|
780 |
# Batch Inference
|
@@ -802,7 +802,7 @@ audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
|
|
802 |
```
|
803 |
```python
|
804 |
# second place, in model processor
|
805 |
-
inputs = processor(text=text,
|
806 |
padding=True, use_audio_in_video=True)
|
807 |
```
|
808 |
```python
|
@@ -813,24 +813,23 @@ It is worth noting that during a multi-round conversation, the `use_audio_in_vid
|
|
813 |
|
814 |
#### Use audio output or not
|
815 |
|
816 |
-
The model supports both text and audio outputs, if users do not need audio outputs, they can
|
817 |
```python
|
818 |
-
model =
|
819 |
"Qwen/Qwen2.5-Omni-7B",
|
820 |
torch_dtype="auto",
|
821 |
-
device_map="auto"
|
822 |
-
enable_audio_output=False,
|
823 |
)
|
|
|
824 |
```
|
825 |
|
826 |
-
In order to obtain a flexible experience, we recommend that users
|
827 |
|
828 |
```python
|
829 |
-
model =
|
830 |
"Qwen/Qwen2.5-Omni-7B",
|
831 |
torch_dtype="auto",
|
832 |
-
device_map="auto"
|
833 |
-
enable_audio_output=True,
|
834 |
)
|
835 |
...
|
836 |
text_ids = model.generate(**inputs, return_audio=False)
|
@@ -844,14 +843,14 @@ Qwen2.5-Omni supports the ability to change the voice of the output audio. The `
|
|
844 |
| Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
|
845 |
| Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
|
846 |
|
847 |
-
Users can use the `
|
848 |
|
849 |
```python
|
850 |
-
text_ids, audio = model.generate(**inputs,
|
851 |
```
|
852 |
|
853 |
```python
|
854 |
-
text_ids, audio = model.generate(**inputs,
|
855 |
```
|
856 |
|
857 |
#### Flash-Attention 2 to speed up generation
|
@@ -867,9 +866,9 @@ Also, you should have hardware that is compatible with FlashAttention 2. Read mo
|
|
867 |
To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
|
868 |
|
869 |
```python
|
870 |
-
from transformers import
|
871 |
|
872 |
-
model =
|
873 |
"Qwen/Qwen2.5-Omni-7B",
|
874 |
device_map="auto",
|
875 |
torch_dtype=torch.bfloat16,
|
|
|
11 |
---
|
12 |
|
13 |
# Qwen2.5-Omni
|
14 |
+
<a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
|
15 |
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
16 |
</a>
|
17 |
|
18 |
|
19 |
+
## Overview
|
20 |
### Introduction
|
21 |
Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
|
22 |
|
|
|
600 |
Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
|
601 |
```
|
602 |
pip uninstall transformers
|
603 |
+
pip install git+https://github.com/BakerBunker/transformers@21dbefaa54e5bf180464696aa70af0bfc7a61d53
|
604 |
pip install accelerate
|
605 |
```
|
606 |
or you might encounter the following error:
|
|
|
613 |
|
614 |
```bash
|
615 |
# It's highly recommended to use `[decord]` feature for faster video loading.
|
616 |
+
pip install qwen-omni-utils[decord] -U
|
617 |
```
|
618 |
|
619 |
+
If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
|
620 |
|
621 |
### 🤗 Transformers Usage
|
622 |
|
|
|
625 |
```python
|
626 |
import soundfile as sf
|
627 |
|
628 |
+
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
|
629 |
from qwen_omni_utils import process_mm_info
|
630 |
|
631 |
# default: Load the model on the available device(s)
|
632 |
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
|
633 |
|
634 |
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
|
635 |
+
# model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
|
636 |
# "Qwen/Qwen2.5-Omni-7B",
|
637 |
# torch_dtype="auto",
|
638 |
# device_map="auto",
|
|
|
660 |
# Preparation for inference
|
661 |
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
|
662 |
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
|
663 |
+
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
|
664 |
inputs = inputs.to(model.device).to(model.dtype)
|
665 |
|
666 |
# Inference: Generation of the output text and audio
|
|
|
687 |
</details>
|
688 |
|
689 |
<details>
|
690 |
+
<summary>Video URL resource usage</summary>
|
691 |
|
692 |
Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
|
693 |
|
|
|
774 |
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
|
775 |
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
|
776 |
|
777 |
+
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
|
778 |
inputs = inputs.to(model.device).to(model.dtype)
|
779 |
|
780 |
# Batch Inference
|
|
|
802 |
```
|
803 |
```python
|
804 |
# second place, in model processor
|
805 |
+
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt",
|
806 |
padding=True, use_audio_in_video=True)
|
807 |
```
|
808 |
```python
|
|
|
813 |
|
814 |
#### Use audio output or not
|
815 |
|
816 |
+
The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disable_talker()` after init the model. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
|
817 |
```python
|
818 |
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
|
819 |
"Qwen/Qwen2.5-Omni-7B",
|
820 |
torch_dtype="auto",
|
821 |
+
device_map="auto"
|
|
|
822 |
)
|
823 |
+
model.disable_talker()
|
824 |
```
|
825 |
|
826 |
+
In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.
|
827 |
|
828 |
```python
|
829 |
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
|
830 |
"Qwen/Qwen2.5-Omni-7B",
|
831 |
torch_dtype="auto",
|
832 |
+
device_map="auto"
|
|
|
833 |
)
|
834 |
...
|
835 |
text_ids = model.generate(**inputs, return_audio=False)
|
|
|
843 |
| Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
|
844 |
| Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
|
845 |
|
846 |
+
Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`.
|
847 |
|
848 |
```python
|
849 |
+
text_ids, audio = model.generate(**inputs, speaker="Chelsie")
|
850 |
```
|
851 |
|
852 |
```python
|
853 |
+
text_ids, audio = model.generate(**inputs, speaker="Ethan")
|
854 |
```
|
855 |
|
856 |
#### Flash-Attention 2 to speed up generation
|
|
|
866 |
To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
|
867 |
|
868 |
```python
|
869 |
+
from transformers import Qwen2_5OmniForConditionalGeneration
|
870 |
|
871 |
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
|
872 |
"Qwen/Qwen2.5-Omni-7B",
|
873 |
device_map="auto",
|
874 |
torch_dtype=torch.bfloat16,
|