|
--- |
|
library_name: transformers |
|
language: |
|
- fr |
|
- de |
|
license: gpl-3.0 |
|
tags: |
|
- ocr |
|
- bloomfilter |
|
- unigram |
|
- impresso |
|
- quality-assessment |
|
- v1.0.6 |
|
--- |
|
|
|
# Model Card for `impresso-project/ocr-quality-assessor-unigram-light` |
|
|
|
## Overview |
|
|
|
This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks. |
|
|
|
It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration. |
|
|
|
## Model Details |
|
### Model Description |
|
|
|
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). |
|
- **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline |
|
- **Languages:** French (fr), German (de) |
|
- **License:** GPL-3.0 |
|
- **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram) |
|
- **Interface:** Hugging Face `transformers` pipeline |
|
- **Input format:** Raw text string |
|
- **Output format:** Float score representing OCR quality |
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light" |
|
|
|
ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME, |
|
trust_remote_code=True, |
|
device='cpu') |
|
|
|
sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, |
|
le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité.""" |
|
|
|
score = ocrqa_pipeline(sentence) |
|
print(score) |
|
``` |
|
|
|
## Output Format |
|
|
|
Returns a single float value indicating the proportion of known tokens: |
|
|
|
```python |
|
{'ocr_quality_score': 0.76} |
|
``` |
|
|
|
## Use Cases |
|
|
|
- OCR pipeline evaluation and quality diagnostics |
|
- Automated scoring of OCR segments or lines |
|
- Quick feedback in web-based transcription and correction tools |
|
|
|
## Dataset and Preprocessing |
|
|
|
The Bloom filters used internally are derived from: |
|
- Wikipedia dumps (historical and modern) |
|
- Impresso-specific lexical resources |
|
|
|
Text normalization includes: |
|
- Unicode NFKC normalization |
|
- Digit masking (0) |
|
- Punctuation and symbol removal |
|
- Lowercasing |
|
|
|
## Limitations |
|
|
|
- Currently supports only **French** and **German** |
|
- Does not provide spell correction suggestions |
|
- False positives are possible (due to the nature of Bloom filters) |
|
- Quality score is approximate and works best at the **segment** or **line** level |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware:** Standard laptop / CPU inference |
|
- **Training:** Reuse of existing Bloom filters; minimal additional compute |
|
- **Estimated Emissions:** < 0.01 kg CO₂eq |
|
|
|
## Contact |
|
|
|
- Website: [https://impresso-project.ch](https://impresso-project.ch) |
|
|
|
<p align="center"> |
|
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/> |
|
</p> |