umt5-thai-g2p / README.md

Update README.md

805b5a0 verified 2 months ago

4.47 kB

	---
	library_name: transformers
	license: apache-2.0
	model-index:
	- name: umt5-thai-g2p-9
	results:
	- task:
	type: text2text-generation
	name: Grapheme-to-Phoneme Conversion
	dataset:
	name: B-K/thai-g2p
	type: B-K/thai-g2p
	config: default
	split: sentence_validation
	metrics:
	- type: cer
	value: 0.094
	name: Character Error Rate
	- type: loss
	value: 1.5449
	name: Loss
	datasets:
	- B-K/thai-g2p
	language:
	- th
	metrics:
	- cer
	pipeline_tag: text2text-generation
	widget:
	- text: สวัสดีครับ
	example_title: Thai G2P Example
	new_version: B-K/umt5-thai-g2p-v2-0.5k
	---

	# umt5-thai-g2p

	This model is a fine-tuned version of [google/umt5-small](https://huggingface.co/google/umt5-small) on the [B-K/thai-g2p](https://huggingface.co/datasets/B-K/thai-g2p) dataset for Thai Grapheme-to-Phoneme (G2P) conversion.

	It achieves the following results on the sentence evaluation set:
	- Loss: 1.5449
	- CER: 0.094

	## Model Description

	`umt5-thai-g2p` is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

	## Intended uses & limitations

	### Intended Uses

	* Thai Grapheme-to-Phoneme (G2P) Conversion: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
	* Speech Synthesis Preprocessing: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

	### Limitations

	* Accuracy: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
	* Out-of-Distribution Data: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the `B-K/thai-g2p` training dataset. This includes very rare words, neologisms, or complex named entities.
	* Ambiguity: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
	* Sentence-Level vs. Word-Level: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
	* Inherited Limitations: As a fine-tuned version of `google/umt5-small`, it inherits the general architectural limitations and scale of the base model.

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
	model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")

	thai_text = "สวัสดีครับ" # Example Thai text
	inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

	outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
	phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(f"Thai Text: {thai_text}")
	print(f"Phonemes: {phonemes}")
	```

	## Training procedure

	### Training Hyperparameters

	The following hyperparameters were used during training:
	* optimizer: adamw_torch
	* learning_rate: (starts with 5e-4 ends with 5e-6)
	* lr_scheduler_type: cosine
	* num_train_epochs: (about 200? i tune the training settings alot)
	* per_device_train_batch_size: 128
	* per_device_eval_batch_size: 128
	* weight_decay: (starts with 0.01 ends with 0.1)
	* label_smoothing_factor: 0.1
	* max_grad_norm: 1.0
	* warmup_steps: 100
	* mixed_precision: bf16

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Cer \| Gen Len \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:------:\|:-------:\|
	\| No log \| 1.0 \| 134 \| 1.5636 \| 0.0917 \| 27.1747 \|
	\| No log \| 2.0 \| 268 \| 1.5603 \| 0.093 \| 27.1781 \|
	\| No log \| 3.0 \| 402 \| 1.5566 \| 0.0938 \| 27.1729 \|
	\| 1.1631 \| 4.0 \| 536 \| 1.5524 \| 0.0941 \| 27.1678 \|
	\| 1.1631 \| 5.0 \| 670 \| 1.5508 \| 0.0939 \| 27.113 \|
	\| 1.1631 \| 6.0 \| 804 \| 1.5472 \| 0.0932 \| 27.1575 \|
	\| 1.1631 \| 7.0 \| 938 \| 1.5450 \| 0.0933 \| 27.1421 \|
	\| 1.1603 \| 8.0 \| 1072 \| 1.5449 \| 0.094 \| 27.0616 \|


	### Framework versions

	- Transformers 4.47.0
	- Pytorch 2.5.1
	- Datasets 3.6.0
	- Tokenizers 0.21.0