File size: 4,467 Bytes
699a94a 35b9634 49bc722 35b9634 805b5a0 699a94a 35b9634 699a94a 35b9634 699a94a 49bc722 699a94a 35b9634 699a94a 35b9634 699a94a 805b5a0 699a94a 35b9634 699a94a 805b5a0 35b9634 699a94a 35b9634 699a94a 35b9634 699a94a 35b9634 699a94a 35b9634 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
library_name: transformers
license: apache-2.0
model-index:
- name: umt5-thai-g2p-9
results:
- task:
type: text2text-generation
name: Grapheme-to-Phoneme Conversion
dataset:
name: B-K/thai-g2p
type: B-K/thai-g2p
config: default
split: sentence_validation
metrics:
- type: cer
value: 0.094
name: Character Error Rate
- type: loss
value: 1.5449
name: Loss
datasets:
- B-K/thai-g2p
language:
- th
metrics:
- cer
pipeline_tag: text2text-generation
widget:
- text: สวัสดีครับ
example_title: Thai G2P Example
new_version: B-K/umt5-thai-g2p-v2-0.5k
---
# umt5-thai-g2p
This model is a fine-tuned version of [google/umt5-small](https://huggingface.co/google/umt5-small) on the [B-K/thai-g2p](https://huggingface.co/datasets/B-K/thai-g2p) dataset for Thai Grapheme-to-Phoneme (G2P) conversion.
It achieves the following results on the sentence evaluation set:
- Loss: 1.5449
- CER: 0.094
## Model Description
`umt5-thai-g2p` is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.
## Intended uses & limitations
### Intended Uses
* **Thai Grapheme-to-Phoneme (G2P) Conversion**: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
* **Speech Synthesis Preprocessing**: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.
### Limitations
* **Accuracy**: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
* **Out-of-Distribution Data**: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the `B-K/thai-g2p` training dataset. This includes very rare words, neologisms, or complex named entities.
* **Ambiguity**: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
* **Sentence-Level vs. Word-Level**: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
* **Inherited Limitations**: As a fine-tuned version of `google/umt5-small`, it inherits the general architectural limitations and scale of the base model.
## How to use
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")
thai_text = "สวัสดีครับ" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Thai Text: {thai_text}")
print(f"Phonemes: {phonemes}")
```
## Training procedure
### Training Hyperparameters
The following hyperparameters were used during training:
* optimizer: adamw_torch
* learning_rate: (starts with 5e-4 ends with 5e-6)
* lr_scheduler_type: cosine
* num_train_epochs: (about 200? i tune the training settings alot)
* per_device_train_batch_size: 128
* per_device_eval_batch_size: 128
* weight_decay: (starts with 0.01 ends with 0.1)
* label_smoothing_factor: 0.1
* max_grad_norm: 1.0
* warmup_steps: 100
* mixed_precision: bf16
### Training results
| Training Loss | Epoch | Step | Validation Loss | Cer | Gen Len |
|:-------------:|:-----:|:----:|:---------------:|:------:|:-------:|
| No log | 1.0 | 134 | 1.5636 | 0.0917 | 27.1747 |
| No log | 2.0 | 268 | 1.5603 | 0.093 | 27.1781 |
| No log | 3.0 | 402 | 1.5566 | 0.0938 | 27.1729 |
| 1.1631 | 4.0 | 536 | 1.5524 | 0.0941 | 27.1678 |
| 1.1631 | 5.0 | 670 | 1.5508 | 0.0939 | 27.113 |
| 1.1631 | 6.0 | 804 | 1.5472 | 0.0932 | 27.1575 |
| 1.1631 | 7.0 | 938 | 1.5450 | 0.0933 | 27.1421 |
| 1.1603 | 8.0 | 1072 | 1.5449 | 0.094 | 27.0616 |
### Framework versions
- Transformers 4.47.0
- Pytorch 2.5.1
- Datasets 3.6.0
- Tokenizers 0.21.0 |