ZiartisNikolas
/

NMT-cypriot-dialect-to-greek

text2text-generation

Model card Files Files and versions

NMT-cypriot-dialect-to-greek / README.md

ZiartisNikolas's picture

Update README.md

c48acf9 verified 3 months ago

|

history blame contribute delete

1.83 kB

	---
	tags:
	- translation
	- nmt
	- cypriot-greek
	- greek
	library_name: transformers
	languages:
	- cy
	- el
	license: cc-by-4.0
	---

	## Model Details

	- Developed by: Nikolas Ziartis
	- Institute: University of Cyprus
	- Model type: MarianMT (Transformer-based Seq2Seq)
	- Source language: Cypriot Greek (ISO 639-1: cy)
	- Target language: Modern Standard Greek (ISO 639-1: el)
	- Fine-tuned from: `Helsinki-NLP/opus-mt-en-grk`
	- License: CC BY 4.0

	## Model Description

	This model is a MarianMT transformer, fine-tuned via active learning to translate from the low-resource Cypriot Greek dialect into Modern Standard Greek. In nine iterative batches, we:

	1. Extracted high-dimensional embeddings for every unlabeled Cypriot sentence using the Greek LLM `ilsp/Meltemi-7B-Instruct-v1.5` :contentReference[oaicite:0]{index=0}.
	2. Applied k-means clustering to select the 50 “most informative” sentence pairs per batch.
	3. Had human annotators translate those 50 sentences into Standard Greek.
	4. Fine-tuned the MarianMT model on the accumulating parallel corpus, freezing and unfreezing layers to preserve learned representations.

	The result is a system that accurately captures colloquial Cypriot expressions while producing fluent Modern Greek.

	## Usage

	```python
	from transformers import MarianMTModel, MarianTokenizer

	model_name = "ZiartisNikolas/NMT-cypriot-dialect-to-greek"
	tokenizer = MarianTokenizer.from_pretrained(model_name)
	model = MarianMTModel.from_pretrained(model_name)

	src = ["Τζ̆αι φυσικά ήξερα ίνταμπου εγινίσκετουν."] # Cypriot Greek sentence
	batch = tokenizer(src, return_tensors="pt", padding=True)
	gen = model.generate(**batch)
	print(tokenizer.batch_decode(gen, skip_special_tokens=True))