Abstract
A multifunctional speech synthesis system integrates voice cloning and emotion control using speaker-emotion disentanglement and rotational emotional embeddings, achieving high expressive and natural speech.
This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech (2025)
- Optimizing Multilingual Text-To-Speech with Accents & Emotions (2025)
- TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet (2025)
- SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech (2025)
- Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters (2025)
- EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis (2025)
- Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper