|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
model-index: |
|
- name: umt5-thai-g2p-9 |
|
results: |
|
- task: |
|
type: text2text-generation |
|
name: Grapheme-to-Phoneme Conversion |
|
dataset: |
|
name: B-K/thai-g2p |
|
type: B-K/thai-g2p |
|
config: default |
|
split: sentence_validation |
|
metrics: |
|
- type: cer |
|
value: 0.094 |
|
name: Character Error Rate |
|
- type: loss |
|
value: 1.5449 |
|
name: Loss |
|
datasets: |
|
- B-K/thai-g2p |
|
language: |
|
- th |
|
metrics: |
|
- cer |
|
pipeline_tag: text2text-generation |
|
widget: |
|
- text: สวัสดีครับ |
|
example_title: Thai G2P Example |
|
new_version: B-K/umt5-thai-g2p-v2-0.5k |
|
--- |
|
|
|
# umt5-thai-g2p |
|
|
|
This model is a fine-tuned version of [google/umt5-small](https://huggingface.co/google/umt5-small) on the [B-K/thai-g2p](https://huggingface.co/datasets/B-K/thai-g2p) dataset for Thai Grapheme-to-Phoneme (G2P) conversion. |
|
|
|
It achieves the following results on the sentence evaluation set: |
|
- Loss: 1.5449 |
|
- CER: 0.094 |
|
|
|
## Model Description |
|
|
|
`umt5-thai-g2p` is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations. |
|
|
|
## Intended uses & limitations |
|
|
|
### Intended Uses |
|
|
|
* **Thai Grapheme-to-Phoneme (G2P) Conversion**: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text. |
|
* **Speech Synthesis Preprocessing**: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing. |
|
|
|
### Limitations |
|
|
|
* **Accuracy**: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes. |
|
* **Out-of-Distribution Data**: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the `B-K/thai-g2p` training dataset. This includes very rare words, neologisms, or complex named entities. |
|
* **Ambiguity**: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts. |
|
* **Sentence-Level vs. Word-Level**: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens. |
|
* **Inherited Limitations**: As a fine-tuned version of `google/umt5-small`, it inherits the general architectural limitations and scale of the base model. |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p") |
|
|
|
thai_text = "สวัสดีครับ" # Example Thai text |
|
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48) |
|
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print(f"Thai Text: {thai_text}") |
|
print(f"Phonemes: {phonemes}") |
|
``` |
|
|
|
## Training procedure |
|
|
|
### Training Hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
* optimizer: adamw_torch |
|
* learning_rate: (starts with 5e-4 ends with 5e-6) |
|
* lr_scheduler_type: cosine |
|
* num_train_epochs: (about 200? i tune the training settings alot) |
|
* per_device_train_batch_size: 128 |
|
* per_device_eval_batch_size: 128 |
|
* weight_decay: (starts with 0.01 ends with 0.1) |
|
* label_smoothing_factor: 0.1 |
|
* max_grad_norm: 1.0 |
|
* warmup_steps: 100 |
|
* mixed_precision: bf16 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Cer | Gen Len | |
|
|:-------------:|:-----:|:----:|:---------------:|:------:|:-------:| |
|
| No log | 1.0 | 134 | 1.5636 | 0.0917 | 27.1747 | |
|
| No log | 2.0 | 268 | 1.5603 | 0.093 | 27.1781 | |
|
| No log | 3.0 | 402 | 1.5566 | 0.0938 | 27.1729 | |
|
| 1.1631 | 4.0 | 536 | 1.5524 | 0.0941 | 27.1678 | |
|
| 1.1631 | 5.0 | 670 | 1.5508 | 0.0939 | 27.113 | |
|
| 1.1631 | 6.0 | 804 | 1.5472 | 0.0932 | 27.1575 | |
|
| 1.1631 | 7.0 | 938 | 1.5450 | 0.0933 | 27.1421 | |
|
| 1.1603 | 8.0 | 1072 | 1.5449 | 0.094 | 27.0616 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.47.0 |
|
- Pytorch 2.5.1 |
|
- Datasets 3.6.0 |
|
- Tokenizers 0.21.0 |