kNN-TTS
While recent zero-shot multi-speaker text-to-speech (TTS) models achieve impressive results, they typically rely on extensive transcribed speech datasets from numerous speakers and intricate training pipelines. Meanwhile, self-supervised learning (SSL) speech features have emerged as effective intermediate representations for TTS. Further, SSL features from different speakers that are linearly close share phonetic information while maintaining individual speaker identity. In this study, we introduce kNN-TTS, a simple and effective framework for zero-shot multi-speaker TTS using retrieval methods which leverage the linear relationships between SSL features. Objective and subjective evaluations show that our models, trained on transcribed speech from a single speaker only, achieve performance comparable to state-of-the-art models that are trained on significantly larger training datasets. The low training data requirements mean that kNN-TTS is well suited for the development of multi-speaker TTS systems for low-resource domains and languages. We also introduce an interpolation parameter which enables fine-grained voice morphing. Demo samples are available at https://idiap.github.io/knn-tts.
Overview
- Training: kNN-TTS was trained on the LJ Speech Dataset
- Parameters: 51.5 M
- Task: Zero-shot Multi-speaker TTS
- Output structure: audio
- Performance: See paper https://arxiv.org/abs/2408.10771 for details
Running kNN-TTS
Please check the project GitHub repository
License
The MIT License (MIT)
Copyright © 2025 Idiap Research Institute
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Citation
If you find our work useful, please cite the following publication:
@misc{hajal2025knntts,
title={kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech},
author={Karl El Hajal and Ajinkya Kulkarni and Enno Hermann and Mathew Magimai. -Doss},
year={2025},
eprint={2408.10771},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2408.10771},
}
- Downloads last month
- 29