--- license: cc-by-4.0 datasets: - amphion/Emilia-Dataset language: - ko base_model: - ResembleAI/chatterbox pipeline_tag: text-to-speech tags: - audio - speech - tts - fine-tuning - chatterbox - Emilia - voice-cloning - zero-shot - korean --- # Chatterbox TTS Korean ๐ŸŒธ **Chatterbox TTS Korean** is a fine-tuned text-to-speech model specialized for the French language. The model has been trained on high-quality voice data for natural and expressive speech synthesis.
- ๐Ÿ”Š **Language**: Korean - ๐Ÿ—ฃ๏ธ **Training dataset**: [Emilia Dataset (KO branch)](https://huggingface.co/datasets/amphion/Emilia-Dataset) - โฑ๏ธ **Data quantity**: 200 hours of audio ## Usage Example Hereโ€™s how to generate speech using Chatterbox-TTS Korean: ```python import torch import soundfile as sf from chatterbox.tts import ChatterboxTTS from huggingface_hub import hf_hub_download from safetensors.torch import load_file # Configuration MODEL_REPO = "Thomcles/Chatterbox-TTS-Korean" T3_FILENAME = "t3_cfg.safetensors" TOKENIZER_FILENAME = "tokenizer_en_ko.json" OUTPUT_PATH = "output_cloned_voice.wav" TEXT_TO_SYNTHESIZE = "๋กœ๋งˆ๋Š” ํ•˜๋ฃจ์•„์นจ์— ์ด๋ฃจ์–ด์ง„ ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค" def get_device() -> str: return "cuda" if torch.cuda.is_available() else "cpu" def download_checkpoint(repo: str, filename: str) -> str: return hf_hub_download(repo_id=repo, filename=filename) def load_tts_model(repo: str, checkpoint_file: str, TOKENIZER_FILENAME:str, device: str) -> ChatterboxTTS: model = ChatterboxTTS.from_pretrained(device=device) checkpoint_path = download_checkpoint(repo, checkpoint_file) t3_state = load_file(checkpoint_path, device="cpu") model.t3.load_state_dict(t3_state) model.tokenizer = EnTokenizer(TOKENIZER_FILENAME) model.t3.text_emb = nn.Embedding(4715+1, model.t3.dim) model.t3.text_head = nn.Linear(model.t3.cfg.hidden_size, 4715+1, bias=False) return model def synthesize_speech(model: ChatterboxTTS, text: str, audio_prompt_path:str, **kwargs) -> torch.Tensor: with torch.inference_mode(): return model.generate( text=text, audio_prompt_path=audio_prompt_path, **kwargs ) def save_audio(waveform: torch.Tensor, path: str, sample_rate: int): sf.write(path, waveform.squeeze().cpu().numpy(), sample_rate) def main(): print("Loading model...") device = get_device() model = load_tts_model(MODEL_REPO, CHECKPOINT_FILENAME, device) print(f"Generating speech on {device}...") wav = synthesize_speech( model, TEXT_TO_SYNTHESIZE, audio_prompt_path=None exaggeration=0.5, temperature=0.6, cfg_weight=0.3 ) print(f"Saving output to: {OUTPUT_PATH}") save_audio(wav, OUTPUT_PATH, model.sr) print("Done.") if __name__ == "__main__": main() ``` Here is the output: ### Base model license The base model is licensed under the MIT License. Base model: [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) License: [MIT](https://choosealicense.com/licenses/mit/) ### Training Data License This model was fine-tuned using a dataset licensed under Creative Commons Attribution 4.0 (CC BY 4.0). Dataset: [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) License: [Creative Commons Attribution 4.0 International](https://choosealicense.com/licenses/cc-by-4.0/) ### Contact me Interested in fine-tuning a TTS model in a specific language or building a multilingual voice solution? Donโ€™t hesitate to reach out.