You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

QomSSLab/Anonymizer-4b

QomSSLab/Anonymizer-4b is a fine-tuned Gemma 3 4B model designed to anonymize Persian legal texts by masking or replacing all personally identifiable information (PII). It is trained on the QomSSLab/Anonymized_Cases dataset.

πŸ’‘ Use Cases

  • Data privacy for legal document processing.
  • Preprocessing step for building publicly shareable Persian legal corpora.
  • Protecting PII in judicial NLP pipelines.

🧠 Model Details

  • Base Model: Gemma 3 4B
  • Language: Persian (Farsi)
  • Training Data: Synthetic and real anonymized Persian legal cases.
  • Task: Text-to-text generation (anonymization)

πŸ“¦ Example Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("QomSSLab/Anonymizer-4b",  device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("QomSSLab/Anonymizer-4b")
tokenizer.add_eos_token = False

messages = [
    {"role": "system", "content": "You are a data privacy expert. Your task is to anonymize the following case text by removing or replacing all personally identifiable information (PII)."},
    {"role": "user", "content": "ΩΎΨ±ΩˆΩ†Ψ―Ω‡β€ŒΨ§ΫŒ Ψ―Ψ±Ψ¨Ψ§Ψ±Ω‡ ازدواج Ψ¨ΫŒΩ† Ω‡Ψ§Ω†ΫŒΩ‡ و ΨΉΨ¨Ψ―Ψ§Ω„Ψ±Ψ­ΫŒΩ… Ψ¨Ψ§ Ψ§Ψ·Ω„Ψ§ΨΉΨ§Ψͺ Ω‡ΩˆΫŒΨͺی Ω…ΨͺΨΉΨ―Ψ―..."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, add_special_tokens=False)
inputs = tokenizer([prompt], return_tensors="pt", add_special_tokens=False).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=400,
    temperature=0.1,
    top_p=0.95,
    top_k=64,
    disable_compile=True
)

anonymized_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(anonymized_text)

πŸ“Š Evaluation

The model was evaluated qualitatively on a diverse collection of Persian legal documents. It effectively identifies and anonymizes a range of personally identifiable information (PII), including:

  • Full names
  • National IDs
  • Addresses
  • Dates of birth
  • Case numbers
  • Geographic locations

The model is particularly well-suited for preprocessing court cases for research, public data release, or downstream tasks like summarization and classification while preserving privacy.

Limitations

  • May occasionally miss rare or out-of-distribution PII formats.
  • Not guaranteed to anonymize very short or extremely noisy texts.
  • Trained primarily on formal legal language; performance may degrade on informal Persian.

πŸ“ Dataset

This model was fine-tuned on the QomSSLab/Anonymized_Cases dataset, which includes manually and synthetically anonymized court documents and legal filings in Persian. The dataset contains a mix of real and simulated entities, helping the model generalize across varied legal formats and writing styles.

Downloads last month
39
Safetensors
Model size
3.88B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train QomSSLab/Anonymizer-4b