metadata
license: gpl-3.0
pipeline_tag: any-to-any
tags:
- omni
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng*
The introduction and usage of Stream-Omni refer to https://github.com/ictnlp/Stream-Omni.
Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following features💡:
- Omni Interaction: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses.
- Seamless "see-while-hear" Experience: Simultaneously output intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o.
- Efficient Training: Require only a small amount of omni-modal data for training.
🖥 Demo
Microphone Input | File Input |
---|---|
Stream-Omni can produce intermediate textual results (ASR transcription and text response) during speech interaction, offering users a seamless "see-while-hear" experience.