|
--- |
|
license: cc-by-4.0 |
|
datasets: |
|
- amphion/Emilia-Dataset |
|
language: |
|
- ko |
|
base_model: |
|
- ResembleAI/chatterbox |
|
pipeline_tag: text-to-speech |
|
tags: |
|
- audio |
|
- speech |
|
- tts |
|
- fine-tuning |
|
- chatterbox |
|
- Emilia |
|
- voice-cloning |
|
- zero-shot |
|
- korean |
|
--- |
|
|
|
# Chatterbox TTS Korean 🌸 |
|
|
|
**Chatterbox TTS Korean** is a fine-tuned text-to-speech model specialized for the French language. The model has been trained on high-quality voice data for natural and expressive speech synthesis. |
|
|
|
<div align="center"><img width="400px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Unification_flag_of_Korea.svg/2560px-Unification_flag_of_Korea.svg.png" /></div> |
|
|
|
- 🔊 **Language**: Korean |
|
- 🗣️ **Training dataset**: [Emilia Dataset (KO branch)](https://huggingface.co/datasets/amphion/Emilia-Dataset) |
|
- ⏱️ **Data quantity**: 200 hours of audio |
|
|
|
## Usage Example |
|
|
|
Here’s how to generate speech using Chatterbox-TTS Korean: |
|
|
|
```python |
|
import torch |
|
import soundfile as sf |
|
from chatterbox.tts import ChatterboxTTS |
|
from huggingface_hub import hf_hub_download |
|
from safetensors.torch import load_file |
|
|
|
# Configuration |
|
MODEL_REPO = "Thomcles/Chatterbox-TTS-Korean" |
|
T3_FILENAME = "t3_cfg.safetensors" |
|
TOKENIZER_FILENAME = "tokenizer_en_ko.json" |
|
OUTPUT_PATH = "output_cloned_voice.wav" |
|
TEXT_TO_SYNTHESIZE = "로마는 하루아침에 이루어진 것이 아니다" |
|
|
|
def get_device() -> str: |
|
return "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
def download_checkpoint(repo: str, filename: str) -> str: |
|
return hf_hub_download(repo_id=repo, filename=filename) |
|
|
|
def load_tts_model(repo: str, checkpoint_file: str, TOKENIZER_FILENAME:str, device: str) -> ChatterboxTTS: |
|
|
|
model = ChatterboxTTS.from_pretrained(device=device) |
|
|
|
checkpoint_path = download_checkpoint(repo, checkpoint_file) |
|
|
|
t3_state = load_file(checkpoint_path, device="cpu") |
|
model.t3.load_state_dict(t3_state) |
|
model.tokenizer = EnTokenizer(TOKENIZER_FILENAME) |
|
model.t3.text_emb = nn.Embedding(4715+1, model.t3.dim) |
|
model.t3.text_head = nn.Linear(model.t3.cfg.hidden_size, 4715+1, bias=False) |
|
|
|
return model |
|
|
|
def synthesize_speech(model: ChatterboxTTS, text: str, audio_prompt_path:str, **kwargs) -> torch.Tensor: |
|
with torch.inference_mode(): |
|
return model.generate( |
|
text=text, |
|
audio_prompt_path=audio_prompt_path, |
|
**kwargs |
|
) |
|
|
|
def save_audio(waveform: torch.Tensor, path: str, sample_rate: int): |
|
sf.write(path, waveform.squeeze().cpu().numpy(), sample_rate) |
|
|
|
def main(): |
|
print("Loading model...") |
|
device = get_device() |
|
model = load_tts_model(MODEL_REPO, CHECKPOINT_FILENAME, device) |
|
|
|
print(f"Generating speech on {device}...") |
|
wav = synthesize_speech( |
|
model, |
|
TEXT_TO_SYNTHESIZE, |
|
audio_prompt_path=None |
|
exaggeration=0.5, |
|
temperature=0.6, |
|
cfg_weight=0.3 |
|
) |
|
|
|
print(f"Saving output to: {OUTPUT_PATH}") |
|
save_audio(wav, OUTPUT_PATH, model.sr) |
|
print("Done.") |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
Here is the output: |
|
|
|
<audio controls src="https://huggingface.co/Thomcles/Chatterbox-TTS-Korean/resolve/main/example.mp3">Your browser does not support audio.</audio> |
|
|
|
### Base model license |
|
|
|
The base model is licensed under the MIT License. |
|
Base model: [Chatterbox](https://huggingface.co/ResembleAI/chatterbox) |
|
License: [MIT](https://choosealicense.com/licenses/mit/) |
|
|
|
### Training Data License |
|
|
|
This model was fine-tuned using a dataset licensed under Creative Commons Attribution 4.0 (CC BY 4.0). |
|
Dataset: [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) |
|
License: [Creative Commons Attribution 4.0 International](https://choosealicense.com/licenses/cc-by-4.0/) |
|
|
|
|
|
### Contact me |
|
|
|
Interested in fine-tuning a TTS model in a specific language or building a multilingual voice solution? Don’t hesitate to reach out. |