File size: 3,892 Bytes
5c7b8d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94f22f1
5c7b8d4
 
b91b1f7
5c7b8d4
b91b1f7
5c7b8d4
d9976ad
5c7b8d4
8bb2c58
5c7b8d4
 
 
 
 
b91b1f7
5c7b8d4
 
 
 
 
 
 
 
 
 
c785a4f
 
5c7b8d4
 
 
 
 
 
 
 
 
c785a4f
 
5c7b8d4
c785a4f
5c7b8d4
c785a4f
5c7b8d4
 
c785a4f
 
 
 
5c7b8d4
 
 
 
f1b15ad
 
 
 
 
5c7b8d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: cc-by-4.0
datasets:
- amphion/Emilia-Dataset
language:
- ko
base_model:
- ResembleAI/chatterbox
pipeline_tag: text-to-speech
tags:
- audio
- speech
- tts
- fine-tuning
- chatterbox
- Emilia
- voice-cloning
- zero-shot
- korean
---

# Chatterbox TTS Korean 🌸

**Chatterbox TTS Korean** is a fine-tuned text-to-speech model specialized for the French language. The model has been trained on high-quality voice data for natural and expressive speech synthesis.

<div align="center"><img width="400px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Unification_flag_of_Korea.svg/2560px-Unification_flag_of_Korea.svg.png" /></div>

- 🔊 **Language**: Korean
- 🗣️ **Training dataset**: [Emilia Dataset (KO branch)](https://huggingface.co/datasets/amphion/Emilia-Dataset)  
- ⏱️ **Data quantity**: 200 hours of audio  

## Usage Example

Here’s how to generate speech using Chatterbox-TTS Korean:

```python
import torch
import soundfile as sf
from chatterbox.tts import ChatterboxTTS
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Configuration
MODEL_REPO = "Thomcles/Chatterbox-TTS-Korean"
T3_FILENAME = "t3_cfg.safetensors"
TOKENIZER_FILENAME = "tokenizer_en_ko.json"
OUTPUT_PATH = "output_cloned_voice.wav"
TEXT_TO_SYNTHESIZE = "로마는 하루아침에 이루어진 것이 아니다"

def get_device() -> str:
    return "cuda" if torch.cuda.is_available() else "cpu"

def download_checkpoint(repo: str, filename: str) -> str:
    return hf_hub_download(repo_id=repo, filename=filename)

def load_tts_model(repo: str, checkpoint_file: str, TOKENIZER_FILENAME:str, device: str) -> ChatterboxTTS:

    model = ChatterboxTTS.from_pretrained(device=device)

    checkpoint_path = download_checkpoint(repo, checkpoint_file)

    t3_state = load_file(checkpoint_path, device="cpu")
    model.t3.load_state_dict(t3_state)
    model.tokenizer = EnTokenizer(TOKENIZER_FILENAME)
    model.t3.text_emb = nn.Embedding(4715+1, model.t3.dim)
    model.t3.text_head = nn.Linear(model.t3.cfg.hidden_size, 4715+1, bias=False)
    
    return model

def synthesize_speech(model: ChatterboxTTS, text: str, audio_prompt_path:str, **kwargs) -> torch.Tensor:
    with torch.inference_mode():
        return model.generate(
            text=text, 
            audio_prompt_path=audio_prompt_path, 
            **kwargs
        )

def save_audio(waveform: torch.Tensor, path: str, sample_rate: int):
    sf.write(path, waveform.squeeze().cpu().numpy(), sample_rate)

def main():
    print("Loading model...")
    device = get_device()
    model = load_tts_model(MODEL_REPO, CHECKPOINT_FILENAME, device)

    print(f"Generating speech on {device}...")
    wav = synthesize_speech(
        model,
        TEXT_TO_SYNTHESIZE,
        audio_prompt_path=None
        exaggeration=0.5,
        temperature=0.6,
        cfg_weight=0.3
    )

    print(f"Saving output to: {OUTPUT_PATH}")
    save_audio(wav, OUTPUT_PATH, model.sr)
    print("Done.")

if __name__ == "__main__":
    main()
```

Here is the output:

<audio controls src="https://huggingface.co/Thomcles/Chatterbox-TTS-Korean/resolve/main/example.mp3">Your browser does not support audio.</audio>

### Base model license

The base model is licensed under the MIT License.  
Base model: [Chatterbox](https://huggingface.co/ResembleAI/chatterbox)  
License: [MIT](https://choosealicense.com/licenses/mit/)  

### Training Data License

This model was fine-tuned using a dataset licensed under Creative Commons Attribution 4.0 (CC BY 4.0).  
Dataset: [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset)  
License: [Creative Commons Attribution 4.0 International](https://choosealicense.com/licenses/cc-by-4.0/)  


### Contact me

Interested in fine-tuning a TTS model in a specific language or building a multilingual voice solution? Don’t hesitate to reach out.