File size: 4,467 Bytes
699a94a
 
 
 
 
35b9634
 
 
 
 
 
 
 
49bc722
35b9634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
805b5a0
 
 
699a94a
 
35b9634
699a94a
35b9634
699a94a
49bc722
699a94a
35b9634
699a94a
35b9634
699a94a
805b5a0
699a94a
 
 
35b9634
699a94a
805b5a0
35b9634
699a94a
35b9634
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
699a94a
 
 
35b9634
699a94a
 
35b9634
 
 
 
 
 
 
 
 
 
 
699a94a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35b9634
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
library_name: transformers
license: apache-2.0
model-index:
- name: umt5-thai-g2p-9
  results:
  - task:
      type: text2text-generation
      name: Grapheme-to-Phoneme Conversion
    dataset:
      name: B-K/thai-g2p
      type: B-K/thai-g2p
      config: default
      split: sentence_validation
    metrics:
    - type: cer
      value: 0.094
      name: Character Error Rate
    - type: loss
      value: 1.5449
      name: Loss
datasets:
- B-K/thai-g2p
language:
- th
metrics:
- cer
pipeline_tag: text2text-generation
widget:
- text: สวัสดีครับ
  example_title: Thai G2P Example
new_version: B-K/umt5-thai-g2p-v2-0.5k
---

# umt5-thai-g2p

This model is a fine-tuned version of [google/umt5-small](https://huggingface.co/google/umt5-small) on the [B-K/thai-g2p](https://huggingface.co/datasets/B-K/thai-g2p) dataset for Thai Grapheme-to-Phoneme (G2P) conversion.

It achieves the following results on the sentence evaluation set:
- Loss: 1.5449
- CER: 0.094

## Model Description

`umt5-thai-g2p` is designed to convert Thai text (words or sentences) into their corresponding phonemic International Phonetic Alphabet (IPA) representations.

## Intended uses & limitations

### Intended Uses

*   **Thai Grapheme-to-Phoneme (G2P) Conversion**: The primary use of this model is to generate phonemic transcriptions (IPA) for Thai text.
*   **Speech Synthesis Preprocessing**: Can be used as a component in a Text-to-Speech (TTS) pipeline to convert input text into phonemes before acoustic model processing.

### Limitations

*   **Accuracy**: While the model achieves a Character Error Rate (CER) of approximately 0.094 on the evaluation set, it is not 100% accurate. Users should expect some errors in the generated phonemes.
*   **Out-of-Distribution Data**: Performance may degrade on words, phrases, or sentence structures significantly different from those present in the `B-K/thai-g2p` training dataset. This includes very rare words, neologisms, or complex named entities.
*   **Ambiguity**: Thai orthography can sometimes be ambiguous, and the model might not always resolve such ambiguities correctly to the intended pronunciation in all contexts.
*   **Sentence-Level vs. Word-Level**: While trained on a dataset that includes sentences, its robustness for very long or highly complex sentences might vary. The average generated length observed during training was around 27 tokens.
*   **Inherited Limitations**: As a fine-tuned version of `google/umt5-small`, it inherits the general architectural limitations and scale of the base model.

## How to use

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("B-K/umt5-thai-g2p")
model = AutoModelForSeq2SeqLM.from_pretrained("B-K/umt5-thai-g2p")

thai_text = "สวัสดีครับ" # Example Thai text
inputs = tokenizer(thai_text, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(**inputs, num_beams=3, max_new_tokens=48)
phonemes = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Thai Text: {thai_text}")
print(f"Phonemes: {phonemes}")
```

## Training procedure

### Training Hyperparameters

The following hyperparameters were used during training:
* optimizer: adamw_torch
* learning_rate: (starts with 5e-4 ends with 5e-6)
* lr_scheduler_type: cosine
* num_train_epochs: (about 200? i tune the training settings alot)
* per_device_train_batch_size: 128
* per_device_eval_batch_size: 128
* weight_decay: (starts with 0.01 ends with 0.1)
* label_smoothing_factor: 0.1
* max_grad_norm: 1.0
* warmup_steps: 100
* mixed_precision: bf16

### Training results

| Training Loss | Epoch | Step | Validation Loss | Cer    | Gen Len |
|:-------------:|:-----:|:----:|:---------------:|:------:|:-------:|
| No log        | 1.0   | 134  | 1.5636          | 0.0917 | 27.1747 |
| No log        | 2.0   | 268  | 1.5603          | 0.093  | 27.1781 |
| No log        | 3.0   | 402  | 1.5566          | 0.0938 | 27.1729 |
| 1.1631        | 4.0   | 536  | 1.5524          | 0.0941 | 27.1678 |
| 1.1631        | 5.0   | 670  | 1.5508          | 0.0939 | 27.113  |
| 1.1631        | 6.0   | 804  | 1.5472          | 0.0932 | 27.1575 |
| 1.1631        | 7.0   | 938  | 1.5450          | 0.0933 | 27.1421 |
| 1.1603        | 8.0   | 1072 | 1.5449          | 0.094  | 27.0616 |


### Framework versions

- Transformers 4.47.0
- Pytorch 2.5.1
- Datasets 3.6.0
- Tokenizers 0.21.0