emanuelaboros's picture
Update README.md
a75602b verified
|
raw
history blame contribute delete
3.7 kB
---
library_name: transformers
language:
- fr
- de
license: gpl-3.0
tags:
- ocr
- bloomfilter
- unigram
- impresso
- quality-assessment
- v1.0.6
---
# Model Card for `impresso-project/ocr-quality-assessor-unigram-light`
## Overview
This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks.
It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration.
## Model Details
### Model Description
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
- **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline
- **Languages:** French (fr), German (de)
- **License:** GPL-3.0
- **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram)
- **Interface:** Hugging Face `transformers` pipeline
- **Input format:** Raw text string
- **Output format:** Float score representing OCR quality
## How to Use
```python
from transformers import pipeline
MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
trust_remote_code=True,
device='cpu')
sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
score = ocrqa_pipeline(sentence)
print(score)
```
## Output Format
Returns a single float value indicating the proportion of known tokens:
```python
{'ocr_quality_score': 0.76}
```
## Use Cases
- OCR pipeline evaluation and quality diagnostics
- Automated scoring of OCR segments or lines
- Quick feedback in web-based transcription and correction tools
## Dataset and Preprocessing
The Bloom filters used internally are derived from:
- Wikipedia dumps (historical and modern)
- Impresso-specific lexical resources
Text normalization includes:
- Unicode NFKC normalization
- Digit masking (0)
- Punctuation and symbol removal
- Lowercasing
## Limitations
- Currently supports only **French** and **German**
- Does not provide spell correction suggestions
- False positives are possible (due to the nature of Bloom filters)
- Quality score is approximate and works best at the **segment** or **line** level
## Environmental Impact
- **Hardware:** Standard laptop / CPU inference
- **Training:** Reuse of existing Bloom filters; minimal additional compute
- **Estimated Emissions:** < 0.01 kg CO₂eq
## Contact
- Website: [https://impresso-project.ch](https://impresso-project.ch)
<p align="center">
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>