|
--- |
|
tags: |
|
- translation |
|
- nmt |
|
- cypriot-greek |
|
- greek |
|
library_name: transformers |
|
languages: |
|
- cy |
|
- el |
|
license: cc-by-4.0 |
|
--- |
|
|
|
## Model Details |
|
|
|
- **Developed by**: Nikolas Ziartis |
|
- **Institute**: University of Cyprus |
|
- **Model type**: MarianMT (Transformer-based Seq2Seq) |
|
- **Source language**: Cypriot Greek (ISO 639-1: cy) |
|
- **Target language**: Modern Standard Greek (ISO 639-1: el) |
|
- **Fine-tuned from**: `Helsinki-NLP/opus-mt-en-grk` |
|
- **License**: CC BY 4.0 |
|
|
|
## Model Description |
|
|
|
This model is a MarianMT transformer, fine-tuned via active learning to translate from the low-resource Cypriot Greek dialect into Modern Standard Greek. In nine iterative batches, we: |
|
|
|
1. **Extracted high-dimensional embeddings** for every unlabeled Cypriot sentence using the Greek LLM `ilsp/Meltemi-7B-Instruct-v1.5` :contentReference[oaicite:0]{index=0}. |
|
2. **Applied k-means clustering** to select the 50 “most informative” sentence pairs per batch. |
|
3. **Had human annotators** translate those 50 sentences into Standard Greek. |
|
4. **Fine-tuned** the MarianMT model on the accumulating parallel corpus, freezing and unfreezing layers to preserve learned representations. |
|
|
|
The result is a system that accurately captures colloquial Cypriot expressions while producing fluent Modern Greek. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
model_name = "ZiartisNikolas/NMT-cypriot-dialect-to-greek" |
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
|
src = ["Τζ̆αι φυσικά ήξερα ίνταμπου εγινίσκετουν."] # Cypriot Greek sentence |
|
batch = tokenizer(src, return_tensors="pt", padding=True) |
|
gen = model.generate(**batch) |
|
print(tokenizer.batch_decode(gen, skip_special_tokens=True)) |
|
|