luoyingfeng
/

BiMaTE-8B

Model card Files Files and versions Community

Update README.md

#1

by Linyuana - opened Jul 8

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +61 -1

README.md CHANGED Viewed

@@ -1,4 +1,64 @@
 ---
 base_model:
 - Qwen/Qwen3-8B-Base
----

 ---
 base_model:
 - Qwen/Qwen3-8B-Base
+---
+# Model Card for Model ID
+⚠️ This is a **temporary repository** for our [EMNLP 2025] demo paper submission.
+The project is currently hosted here for review and demonstration purposes.
+It will be migrated to the official organization repository once it becomes available.
+All code, models, and documentation are maintained here until then.
+Github: [LMT](https://github.com/NiuTrans/LMT)
+## Model Details
+### Model Description
+BiMaTE (Bi-Centric Machine Translation Expert) is a large-scale, LLM-based, Chinese-English-Centric multilingual translation model designed to facilitate high-quality translation between Chinese, English, and numerous other global languages.
+- **Model type:** Causal Language Model for Machine Translation
+- **Languages:** 60
+- **Translation directions:** 234
+- **Base Model:** Qwen3-8B-Base
+- **Training Strategy:**
+    1. Monolingual Continual Pretraining (CPT): 30B tokens
+    2. Mixed Continual Pretraining (CPT): 60B tokens (monolingual, bilingual)
+    3. Supervised Finetuning (SFT): Post-training on smaller-scale, high-quality translation data.
+## Quickstart
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "luoyingfeng/BiMaTE-8B"
+tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
+model = AutoModelForCausalLM.from_pretrained(model_name)
+prompt = "Translate the following text from English into Chinese.\nEnglish: The concept came from China where plum blossoms were the flower of choice.\nChinese: "
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(**model_inputs, max_new_tokens=512, num_beams=5, do_sample=False)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
+print("response:", outputs)
+```
+## Support Languages
+| Resource Tier | Languages |
+| :---- | :---- |
+| High-resource Languages (13) | Arabic(ar), English(en), Spanish(es), German(de), French(fr), Italian(it), Japanese(ja), Dutch(nl), Polish(pl), Portuguese(pt), Russian(ru), Turkish(tr), Chinese(zh) |
+| Medium-resource Languages (18) | Bulgarian(bg), Bengali(bn), Czech(cs), Danish(da), Modern Greek(el), Persian(fa), Finnish(fi), Hindi(hi), Hungarian(hu), Indonesian(id), Korean(ko), Norwegian(no), Romanian(ro), Slovak(sk), Swedish(sv), Thai(th), Ukrainian(uk), Vietnamese(vi) |
+| Low-resouce Languages (29) | Amharic(am), Azerbaijani(az), Tibetan(bo), Modern Hebrew(he), Croatian(hr), Armenian(hy), Icelandic(is), Javanese(jv), Georgian(ka), Kazakh(kk), Central Khmer(km), Kirghiz(ky), Lao(lo), Mongolian(mn), Marathi(mr), Malay(ms), Burmese(my), Nepali(ne), Pashto(ps), Sinhala(si), Swahili(sw), Tamil(ta), Telugu(te), Tajik(tg), Tagalog(tl), Uighur(ug), Urdu(ur), Uzbek(uz), Yue Chinese(yue) |