File size: 1,831 Bytes
8a7a0cd
92d6ceb
 
 
 
 
8a7a0cd
92d6ceb
 
 
 
8a7a0cd
 
 
 
92d6ceb
 
 
 
 
 
 
8a7a0cd
92d6ceb
8a7a0cd
92d6ceb
8a7a0cd
92d6ceb
 
 
 
8a7a0cd
92d6ceb
8a7a0cd
92d6ceb
8a7a0cd
92d6ceb
 
8a7a0cd
92d6ceb
 
 
8a7a0cd
c48acf9
92d6ceb
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
tags:
  - translation
  - nmt
  - cypriot-greek
  - greek
library_name: transformers
languages:
  - cy
  - el
license: cc-by-4.0
---

## Model Details

- **Developed by**: Nikolas Ziartis  
- **Institute**: University of Cyprus  
- **Model type**: MarianMT (Transformer-based Seq2Seq)  
- **Source language**: Cypriot Greek (ISO 639-1: cy)  
- **Target language**: Modern Standard Greek (ISO 639-1: el)  
- **Fine-tuned from**: `Helsinki-NLP/opus-mt-en-grk`  
- **License**: CC BY 4.0  

## Model Description

This model is a MarianMT transformer, fine-tuned via active learning to translate from the low-resource Cypriot Greek dialect into Modern Standard Greek. In nine iterative batches, we:

1. **Extracted high-dimensional embeddings** for every unlabeled Cypriot sentence using the Greek LLM `ilsp/Meltemi-7B-Instruct-v1.5` :contentReference[oaicite:0]{index=0}.  
2. **Applied k-means clustering** to select the 50 “most informative” sentence pairs per batch.  
3. **Had human annotators** translate those 50 sentences into Standard Greek.  
4. **Fine-tuned** the MarianMT model on the accumulating parallel corpus, freezing and unfreezing layers to preserve learned representations.  

The result is a system that accurately captures colloquial Cypriot expressions while producing fluent Modern Greek.

## Usage

```python
from transformers import MarianMTModel, MarianTokenizer

model_name = "ZiartisNikolas/NMT-cypriot-dialect-to-greek"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model     = MarianMTModel.from_pretrained(model_name)

src = ["Τζ̆αι φυσικά ήξερα ίνταμπου εγινίσκετουν."]  # Cypriot Greek sentence
batch = tokenizer(src, return_tensors="pt", padding=True)
gen   = model.generate(**batch)
print(tokenizer.batch_decode(gen, skip_special_tokens=True))