Files changed (1) hide show
  1. README.md +61 -1
README.md CHANGED
@@ -1,4 +1,64 @@
1
  ---
2
  base_model:
3
  - Qwen/Qwen3-8B-Base
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - Qwen/Qwen3-8B-Base
4
+ ---
5
+ # Model Card for Model ID
6
+
7
+ ⚠️ This is a **temporary repository** for our [EMNLP 2025] demo paper submission.
8
+ The project is currently hosted here for review and demonstration purposes.
9
+ It will be migrated to the official organization repository once it becomes available.
10
+
11
+ All code, models, and documentation are maintained here until then.
12
+
13
+ Github: [LMT](https://github.com/NiuTrans/LMT)
14
+
15
+ ## Model Details
16
+
17
+ ### Model Description
18
+
19
+ BiMaTE (Bi-Centric Machine Translation Expert) is a large-scale, LLM-based, Chinese-English-Centric multilingual translation model designed to facilitate high-quality translation between Chinese, English, and numerous other global languages.
20
+
21
+ - **Model type:** Causal Language Model for Machine Translation
22
+ - **Languages:** 60
23
+ - **Translation directions:** 234
24
+ - **Base Model:** Qwen3-8B-Base
25
+ - **Training Strategy:**
26
+ 1. Monolingual Continual Pretraining (CPT): 30B tokens
27
+ 2. Mixed Continual Pretraining (CPT): 60B tokens (monolingual, bilingual)
28
+ 3. Supervised Finetuning (SFT): Post-training on smaller-scale, high-quality translation data.
29
+
30
+ ## Quickstart
31
+
32
+ ```python
33
+ from transformers import AutoModelForCausalLM, AutoTokenizer
34
+
35
+ model_name = "luoyingfeng/BiMaTE-8B"
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
38
+ model = AutoModelForCausalLM.from_pretrained(model_name)
39
+
40
+ prompt = "Translate the following text from English into Chinese.\nEnglish: The concept came from China where plum blossoms were the flower of choice.\nChinese: "
41
+ messages = [{"role": "user", "content": prompt}]
42
+ text = tokenizer.apply_chat_template(
43
+ messages,
44
+ tokenize=False,
45
+ add_generation_prompt=True,
46
+ )
47
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
48
+
49
+ generated_ids = model.generate(**model_inputs, max_new_tokens=512, num_beams=5, do_sample=False)
50
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
51
+
52
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
53
+
54
+ print("response:", outputs)
55
+ ```
56
+
57
+ ## Support Languages
58
+
59
+ | Resource Tier | Languages |
60
+ | :---- | :---- |
61
+ | High-resource Languages (13) | Arabic(ar), English(en), Spanish(es), German(de), French(fr), Italian(it), Japanese(ja), Dutch(nl), Polish(pl), Portuguese(pt), Russian(ru), Turkish(tr), Chinese(zh) |
62
+ | Medium-resource Languages (18) | Bulgarian(bg), Bengali(bn), Czech(cs), Danish(da), Modern Greek(el), Persian(fa), Finnish(fi), Hindi(hi), Hungarian(hu), Indonesian(id), Korean(ko), Norwegian(no), Romanian(ro), Slovak(sk), Swedish(sv), Thai(th), Ukrainian(uk), Vietnamese(vi) |
63
+ | Low-resouce Languages (29) | Amharic(am), Azerbaijani(az), Tibetan(bo), Modern Hebrew(he), Croatian(hr), Armenian(hy), Icelandic(is), Javanese(jv), Georgian(ka), Kazakh(kk), Central Khmer(km), Kirghiz(ky), Lao(lo), Mongolian(mn), Marathi(mr), Malay(ms), Burmese(my), Nepali(ne), Pashto(ps), Sinhala(si), Swahili(sw), Tamil(ta), Telugu(te), Tajik(tg), Tagalog(tl), Uighur(ug), Urdu(ur), Uzbek(uz), Yue Chinese(yue) |
64
+