Pankaj8922's picture
Update README.md
b513145 verified
metadata
license: apache-2.0
datasets:
  - ai4bharat/samanantar
language:
  - en
  - hi
base_model:
  - Helsinki-NLP/opus-mt-en-hi

Pankaj8922/better-opus-mt-en-hi

Fine-tuned MarianMT model for English β†’ Hindi translation. This model is trained on AI4Bharat's Samanantar dataset, which contains over 10 million high-quality parallel sentences.

πŸ” Model Details

  • Base model: Helsinki-NLP/opus-mt-en-hi
  • Fine-tuned on: ai4bharat/samanantar English–Hindi subset
  • Total params: ~77M (MarianMT)
  • Framework: Hugging Face Transformers

πŸ“Š Performance (BLEU / chrF on 500 samples from Namratap/En-Hindi)

Domain Base BLEU Fine-tuned BLEU Base chrF Fine-tuned chrF
Healthcare 15.54 27.95 38.06 54.09
Gen News 14.11 26.31 39.07 52.98
Culture/Tourism 12.76 18.49 35.07 41.32
Education 20.28 28.82 43.84 49.68

βœ… BLEU improvements of +8 to +13 points across domains
βœ… chrF boosts up to +16 points, reflecting better fluency and coverage

🧠 Use Cases

  • Book and news translation (Hindi)
  • Offline/secure translation pipelines
  • Domain-adapted fine-tuning

πŸ“ Files Included

  • pytorch_model.bin β€” fine-tuned model weights
  • config.json β€” model architecture
  • tokenizer_config.json, vocab.json, source.spm, target.spm β€” tokenizer
  • generation_config.json β€” default decoding setup

βš–οΈ License

Apache 2.0 (Same as original model and Samanantar dataset)