xiongwang commited on
Commit
dc4921a
·
verified ·
1 Parent(s): 67f8902

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -25
README.md CHANGED
@@ -11,12 +11,12 @@ pipeline_tag: any-to-any
11
  ---
12
 
13
  # Qwen2.5-Omni
14
- <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
15
  <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
16
  </a>
17
 
18
 
19
- ## OverView
20
  ### Introduction
21
  Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
22
 
@@ -600,7 +600,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
600
  Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
601
  ```
602
  pip uninstall transformers
603
- pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
604
  pip install accelerate
605
  ```
606
  or you might encounter the following error:
@@ -613,10 +613,10 @@ We offer a toolkit to help you handle various types of audio and visual input mo
613
 
614
  ```bash
615
  # It's highly recommended to use `[decord]` feature for faster video loading.
616
- pip install qwen-omni-utils[decord]
617
  ```
618
 
619
- If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
620
 
621
  ### 🤗 Transformers Usage
622
 
@@ -625,14 +625,14 @@ Here we show a code snippet to show you how to use the chat model with `transfor
625
  ```python
626
  import soundfile as sf
627
 
628
- from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
629
  from qwen_omni_utils import process_mm_info
630
 
631
  # default: Load the model on the available device(s)
632
- model = Qwen2_5OmniModel.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
633
 
634
  # We recommend enabling flash_attention_2 for better acceleration and memory saving.
635
- # model = Qwen2_5OmniModel.from_pretrained(
636
  # "Qwen/Qwen2.5-Omni-7B",
637
  # torch_dtype="auto",
638
  # device_map="auto",
@@ -660,7 +660,7 @@ USE_AUDIO_IN_VIDEO = True
660
  # Preparation for inference
661
  text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
662
  audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
663
- inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
664
  inputs = inputs.to(model.device).to(model.dtype)
665
 
666
  # Inference: Generation of the output text and audio
@@ -687,7 +687,7 @@ Note: The table above presents the theoretical minimum memory requirements for i
687
  </details>
688
 
689
  <details>
690
- <summary>Video ULR resource usage</summary>
691
 
692
  Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
693
 
@@ -774,7 +774,7 @@ USE_AUDIO_IN_VIDEO = True
774
  text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
775
  audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
776
 
777
- inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
778
  inputs = inputs.to(model.device).to(model.dtype)
779
 
780
  # Batch Inference
@@ -802,7 +802,7 @@ audios, images, videos = process_mm_info(conversations, use_audio_in_video=True)
802
  ```
803
  ```python
804
  # second place, in model processor
805
- inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt",
806
  padding=True, use_audio_in_video=True)
807
  ```
808
  ```python
@@ -813,24 +813,23 @@ It is worth noting that during a multi-round conversation, the `use_audio_in_vid
813
 
814
  #### Use audio output or not
815
 
816
- The model supports both text and audio outputs, if users do not need audio outputs, they can set `enable_audio_output=False` in the `from_pretrained` function. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
817
  ```python
818
- model = Qwen2_5OmniModel.from_pretrained(
819
  "Qwen/Qwen2.5-Omni-7B",
820
  torch_dtype="auto",
821
- device_map="auto",
822
- enable_audio_output=False,
823
  )
 
824
  ```
825
 
826
- In order to obtain a flexible experience, we recommend that users set `enable_audio_output` at `True` when initializing the model through `from_pretrained` function, and then decide whether to return audio when `generate` function is called. When `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.
827
 
828
  ```python
829
- model = Qwen2_5OmniModel.from_pretrained(
830
  "Qwen/Qwen2.5-Omni-7B",
831
  torch_dtype="auto",
832
- device_map="auto",
833
- enable_audio_output=True,
834
  )
835
  ...
836
  text_ids = model.generate(**inputs, return_audio=False)
@@ -844,14 +843,14 @@ Qwen2.5-Omni supports the ability to change the voice of the output audio. The `
844
  | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
845
  | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
846
 
847
- Users can use the `spk` parameter of `generate` function to specify the voice type. By default, if `spk` is not specified, the default voice type is `Chelsie`.
848
 
849
  ```python
850
- text_ids, audio = model.generate(**inputs, spk="Chelsie")
851
  ```
852
 
853
  ```python
854
- text_ids, audio = model.generate(**inputs, spk="Ethan")
855
  ```
856
 
857
  #### Flash-Attention 2 to speed up generation
@@ -867,9 +866,9 @@ Also, you should have hardware that is compatible with FlashAttention 2. Read mo
867
  To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
868
 
869
  ```python
870
- from transformers import Qwen2_5OmniModel
871
 
872
- model = Qwen2_5OmniModel.from_pretrained(
873
  "Qwen/Qwen2.5-Omni-7B",
874
  device_map="auto",
875
  torch_dtype=torch.bfloat16,
 
11
  ---
12
 
13
  # Qwen2.5-Omni
14
+ <a href="https://chat.qwen.ai/" target="_blank" style="margin: 2px;">
15
  <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
16
  </a>
17
 
18
 
19
+ ## Overview
20
  ### Introduction
21
  Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
22
 
 
600
  Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
601
  ```
602
  pip uninstall transformers
603
+ pip install git+https://github.com/BakerBunker/transformers@21dbefaa54e5bf180464696aa70af0bfc7a61d53
604
  pip install accelerate
605
  ```
606
  or you might encounter the following error:
 
613
 
614
  ```bash
615
  # It's highly recommended to use `[decord]` feature for faster video loading.
616
+ pip install qwen-omni-utils[decord] -U
617
  ```
618
 
619
+ If you are not using Linux, you might not be able to install `decord` from PyPI. In that case, you can use `pip install qwen-omni-utils -U` which will fall back to using torchvision for video processing. However, you can still [install decord from source](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source) to get decord used when loading video.
620
 
621
  ### 🤗 Transformers Usage
622
 
 
625
  ```python
626
  import soundfile as sf
627
 
628
+ from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
629
  from qwen_omni_utils import process_mm_info
630
 
631
  # default: Load the model on the available device(s)
632
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
633
 
634
  # We recommend enabling flash_attention_2 for better acceleration and memory saving.
635
+ # model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
636
  # "Qwen/Qwen2.5-Omni-7B",
637
  # torch_dtype="auto",
638
  # device_map="auto",
 
660
  # Preparation for inference
661
  text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
662
  audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
663
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
664
  inputs = inputs.to(model.device).to(model.dtype)
665
 
666
  # Inference: Generation of the output text and audio
 
687
  </details>
688
 
689
  <details>
690
+ <summary>Video URL resource usage</summary>
691
 
692
  Video URL compatibility largely depends on the third-party library version. The details are in the table below. Change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
693
 
 
774
  text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
775
  audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
776
 
777
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
778
  inputs = inputs.to(model.device).to(model.dtype)
779
 
780
  # Batch Inference
 
802
  ```
803
  ```python
804
  # second place, in model processor
805
+ inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt",
806
  padding=True, use_audio_in_video=True)
807
  ```
808
  ```python
 
813
 
814
  #### Use audio output or not
815
 
816
+ The model supports both text and audio outputs, if users do not need audio outputs, they can call `model.disable_talker()` after init the model. This option will save about `~2GB` of GPU memory but the `return_audio` option for `generate` function will only allow to be set at `False`.
817
  ```python
818
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
819
  "Qwen/Qwen2.5-Omni-7B",
820
  torch_dtype="auto",
821
+ device_map="auto"
 
822
  )
823
+ model.disable_talker()
824
  ```
825
 
826
+ In order to obtain a flexible experience, we recommend that users can decide whether to return audio when `generate` function is called. If `return_audio` is set to `False`, the model will only return text outputs to get text responses faster.
827
 
828
  ```python
829
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
830
  "Qwen/Qwen2.5-Omni-7B",
831
  torch_dtype="auto",
832
+ device_map="auto"
 
833
  )
834
  ...
835
  text_ids = model.generate(**inputs, return_audio=False)
 
843
  | Chelsie | Female | A honeyed, velvety voice that carries a gentle warmth and luminous clarity.|
844
  | Ethan | Male | A bright, upbeat voice with infectious energy and a warm, approachable vibe.|
845
 
846
+ Users can use the `speaker` parameter of `generate` function to specify the voice type. By default, if `speaker` is not specified, the default voice type is `Chelsie`.
847
 
848
  ```python
849
+ text_ids, audio = model.generate(**inputs, speaker="Chelsie")
850
  ```
851
 
852
  ```python
853
+ text_ids, audio = model.generate(**inputs, speaker="Ethan")
854
  ```
855
 
856
  #### Flash-Attention 2 to speed up generation
 
866
  To load and run a model using FlashAttention-2, add `attn_implementation="flash_attention_2"` when loading the model:
867
 
868
  ```python
869
+ from transformers import Qwen2_5OmniForConditionalGeneration
870
 
871
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
872
  "Qwen/Qwen2.5-Omni-7B",
873
  device_map="auto",
874
  torch_dtype=torch.bfloat16,