[Fine-tuning for other languages] How to obtain or train "qwen-tts-tokenizer" used in Qwen2.5-Omni?
I'm currently exploring the Qwen2.5-Omni multimodal model (as described in the Qwen2.5-Omni Technical Report and particularly interested in adapting it to generate speech in languages beyond English and Chinese, specifically Korean.
The Qwen2.5-Omni paper describes that the Talker module generates speech using discrete codec tokens produced by an audio tokenizer named "qwen-tts-tokenizer":
βWe designed an efficient speech codec named qwen-tts-tokenizer. qwen-tts-tokenizer efficiently represents key information of speech and can be decoded to speech streamingly through a causal audio decoder.β
However, the publicly available implementation (e.g., on Hugging Face) appears to only include the decoder part of the model pipeline:
- Talker: Text β speech codec tokens
- Token2Wav: Speech codec tokens β mel-spectrogram β waveform
I couldn't find any publicly accessible method or code snippet for the reverse step (waveform β codec tokens) using the mentioned "qwen-tts-tokenizer".
This leads to a few questions:
Is the "qwen-tts-tokenizer" publicly available or open-sourced?
If yes, could you point me to the repository or implementation?If not publicly available, are there plans to release the tokenizer or its pretrained checkpoints?
Alternatively, could you share what kind of model architecture or codec method (e.g., EnCodec, SoundStream, VQ-VAE) the tokenizer uses?
This information would be extremely helpful to recreate or fine-tune a similar tokenizer.
Having access to this tokenizer is critical to training the Talker module for new languages (like Korean), as otherwise, creating the necessary training data (text β codec token pairs) is not possible.
I appreciate any guidance, details, or alternative recommendations you can provide.
Thank you!
cc. @littlebird13 , @xiongwang , @Jin-xu