Update README.md

94f22f1 verified 13 days ago

3.89 kB

	---
	license: cc-by-4.0
	datasets:
	- amphion/Emilia-Dataset
	language:
	- ko
	base_model:
	- ResembleAI/chatterbox
	pipeline_tag: text-to-speech
	tags:
	- audio
	- speech
	- tts
	- fine-tuning
	- chatterbox
	- Emilia
	- voice-cloning
	- zero-shot
	- korean
	---

	# Chatterbox TTS Korean 🌸

	Chatterbox TTS Korean is a fine-tuned text-to-speech model specialized for the French language. The model has been trained on high-quality voice data for natural and expressive speech synthesis.

	<div align="center"><img width="400px" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Unification_flag_of_Korea.svg/2560px-Unification_flag_of_Korea.svg.png" /></div>

	- 🔊 Language: Korean
	- 🗣️ Training dataset: [Emilia Dataset (KO branch)](https://huggingface.co/datasets/amphion/Emilia-Dataset)
	- ⏱️ Data quantity: 200 hours of audio

	## Usage Example

	Here’s how to generate speech using Chatterbox-TTS Korean:

	```python
	import torch
	import soundfile as sf
	from chatterbox.tts import ChatterboxTTS
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file

	# Configuration
	MODEL_REPO = "Thomcles/Chatterbox-TTS-Korean"
	T3_FILENAME = "t3_cfg.safetensors"
	TOKENIZER_FILENAME = "tokenizer_en_ko.json"
	OUTPUT_PATH = "output_cloned_voice.wav"
	TEXT_TO_SYNTHESIZE = "로마는 하루아침에 이루어진 것이 아니다"

	def get_device() -> str:
	return "cuda" if torch.cuda.is_available() else "cpu"

	def download_checkpoint(repo: str, filename: str) -> str:
	return hf_hub_download(repo_id=repo, filename=filename)

	def load_tts_model(repo: str, checkpoint_file: str, TOKENIZER_FILENAME:str, device: str) -> ChatterboxTTS:

	model = ChatterboxTTS.from_pretrained(device=device)

	checkpoint_path = download_checkpoint(repo, checkpoint_file)

	t3_state = load_file(checkpoint_path, device="cpu")
	model.t3.load_state_dict(t3_state)
	model.tokenizer = EnTokenizer(TOKENIZER_FILENAME)
	model.t3.text_emb = nn.Embedding(4715+1, model.t3.dim)
	model.t3.text_head = nn.Linear(model.t3.cfg.hidden_size, 4715+1, bias=False)

	return model

	def synthesize_speech(model: ChatterboxTTS, text: str, audio_prompt_path:str, **kwargs) -> torch.Tensor:
	with torch.inference_mode():
	return model.generate(
	text=text,
	audio_prompt_path=audio_prompt_path,
	**kwargs
	)

	def save_audio(waveform: torch.Tensor, path: str, sample_rate: int):
	sf.write(path, waveform.squeeze().cpu().numpy(), sample_rate)

	def main():
	print("Loading model...")
	device = get_device()
	model = load_tts_model(MODEL_REPO, CHECKPOINT_FILENAME, device)

	print(f"Generating speech on {device}...")
	wav = synthesize_speech(
	model,
	TEXT_TO_SYNTHESIZE,
	audio_prompt_path=None
	exaggeration=0.5,
	temperature=0.6,
	cfg_weight=0.3
	)

	print(f"Saving output to: {OUTPUT_PATH}")
	save_audio(wav, OUTPUT_PATH, model.sr)
	print("Done.")

	if __name__ == "__main__":
	main()
	```

	Here is the output:

	<audio controls src="https://huggingface.co/Thomcles/Chatterbox-TTS-Korean/resolve/main/example.mp3">Your browser does not support audio.</audio>

	### Base model license

	The base model is licensed under the MIT License.
	Base model: [Chatterbox](https://huggingface.co/ResembleAI/chatterbox)
	License: [MIT](https://choosealicense.com/licenses/mit/)

	### Training Data License

	This model was fine-tuned using a dataset licensed under Creative Commons Attribution 4.0 (CC BY 4.0).
	Dataset: [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset)
	License: [Creative Commons Attribution 4.0 International](https://choosealicense.com/licenses/cc-by-4.0/)


	### Contact me

	Interested in fine-tuning a TTS model in a specific language or building a multilingual voice solution? Don’t hesitate to reach out.