File size: 1,831 Bytes
8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd 92d6ceb 8a7a0cd c48acf9 92d6ceb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
---
tags:
- translation
- nmt
- cypriot-greek
- greek
library_name: transformers
languages:
- cy
- el
license: cc-by-4.0
---
## Model Details
- **Developed by**: Nikolas Ziartis
- **Institute**: University of Cyprus
- **Model type**: MarianMT (Transformer-based Seq2Seq)
- **Source language**: Cypriot Greek (ISO 639-1: cy)
- **Target language**: Modern Standard Greek (ISO 639-1: el)
- **Fine-tuned from**: `Helsinki-NLP/opus-mt-en-grk`
- **License**: CC BY 4.0
## Model Description
This model is a MarianMT transformer, fine-tuned via active learning to translate from the low-resource Cypriot Greek dialect into Modern Standard Greek. In nine iterative batches, we:
1. **Extracted high-dimensional embeddings** for every unlabeled Cypriot sentence using the Greek LLM `ilsp/Meltemi-7B-Instruct-v1.5` :contentReference[oaicite:0]{index=0}.
2. **Applied k-means clustering** to select the 50 “most informative” sentence pairs per batch.
3. **Had human annotators** translate those 50 sentences into Standard Greek.
4. **Fine-tuned** the MarianMT model on the accumulating parallel corpus, freezing and unfreezing layers to preserve learned representations.
The result is a system that accurately captures colloquial Cypriot expressions while producing fluent Modern Greek.
## Usage
```python
from transformers import MarianMTModel, MarianTokenizer
model_name = "ZiartisNikolas/NMT-cypriot-dialect-to-greek"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
src = ["Τζ̆αι φυσικά ήξερα ίνταμπου εγινίσκετουν."] # Cypriot Greek sentence
batch = tokenizer(src, return_tensors="pt", padding=True)
gen = model.generate(**batch)
print(tokenizer.batch_decode(gen, skip_special_tokens=True))
|