Model Card for `impresso-project/ocr-quality-assessor-unigram-light`

Overview

This model is a lightweight OCR quality assessor for historical French and German texts. It is a streamlined version of the original impresso-project/OCR-quality-assessment-unigram, now accessible via a Hugging Face pipeline for convenient integration into downstream tasks.

It uses Bloom filters containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the Impresso Project, which develops tools for media archive processing and exploration.

Model Details

Model Description

Developed by: University of Zurich (UZH) from the Impresso team. The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation (CRSII5_173719, CRSII5_213585) and the Luxembourg National Research Fund (grant No. 17498891).
Model type: Bloom filter–based scoring via a Transformers-compatible pipeline
Languages: French (fr), German (de)
License: GPL-3.0
Base resource: impresso-project/OCR-quality-assessment-unigram
Interface: Hugging Face transformers pipeline
Input format: Raw text string
Output format: Float score representing OCR quality

How to Use

from transformers import pipeline

MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"

ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME, 
                          trust_remote_code=True, 
                          device='cpu')

sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
          le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""

score = ocrqa_pipeline(sentence)
print(score)

Output Format

Returns a single float value indicating the proportion of known tokens:

{'ocr_quality_score': 0.76}

Use Cases

OCR pipeline evaluation and quality diagnostics
Automated scoring of OCR segments or lines
Quick feedback in web-based transcription and correction tools

Dataset and Preprocessing

The Bloom filters used internally are derived from:

Wikipedia dumps (historical and modern)
Impresso-specific lexical resources

Text normalization includes:

Unicode NFKC normalization
Digit masking (0)
Punctuation and symbol removal
Lowercasing

Limitations

Currently supports only French and German
Does not provide spell correction suggestions
False positives are possible (due to the nature of Bloom filters)
Quality score is approximate and works best at the segment or line level

Environmental Impact

Hardware: Standard laptop / CPU inference
Training: Reuse of existing Bloom filters; minimal additional compute
Estimated Emissions: < 0.01 kg CO₂eq

Contact

Website: https://impresso-project.ch

Impresso Logo

Model Card for impresso-project/ocr-quality-assessor-unigram-light