File size: 3,697 Bytes
b448ba9 5e2a201 b448ba9 5e2a201 b448ba9 a923b82 b448ba9 5e2a201 b448ba9 5e2a201 a923b82 5e2a201 a923b82 5e2a201 a923b82 5e2a201 a923b82 5e2a201 b448ba9 0c32485 b448ba9 fce343d 5e2a201 b448ba9 fce343d b448ba9 5e2a201 fce343d 5e2a201 b448ba9 5e2a201 b448ba9 5e2a201 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
library_name: transformers
language:
- fr
- de
license: gpl-3.0
tags:
- ocr
- bloomfilter
- unigram
- impresso
- quality-assessment
- v1.0.6
---
# Model Card for `impresso-project/ocr-quality-assessor-unigram-light`
## Overview
This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks.
It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration.
## Model Details
### Model Description
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
- **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline
- **Languages:** French (fr), German (de)
- **License:** GPL-3.0
- **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram)
- **Interface:** Hugging Face `transformers` pipeline
- **Input format:** Raw text string
- **Output format:** Float score representing OCR quality
## How to Use
```python
from transformers import pipeline
MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
trust_remote_code=True,
device='cpu')
sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
score = ocrqa_pipeline(sentence)
print(score)
```
## Output Format
Returns a single float value indicating the proportion of known tokens:
```python
{'ocr_quality_score': 0.76}
```
## Use Cases
- OCR pipeline evaluation and quality diagnostics
- Automated scoring of OCR segments or lines
- Quick feedback in web-based transcription and correction tools
## Dataset and Preprocessing
The Bloom filters used internally are derived from:
- Wikipedia dumps (historical and modern)
- Impresso-specific lexical resources
Text normalization includes:
- Unicode NFKC normalization
- Digit masking (0)
- Punctuation and symbol removal
- Lowercasing
## Limitations
- Currently supports only **French** and **German**
- Does not provide spell correction suggestions
- False positives are possible (due to the nature of Bloom filters)
- Quality score is approximate and works best at the **segment** or **line** level
## Environmental Impact
- **Hardware:** Standard laptop / CPU inference
- **Training:** Reuse of existing Bloom filters; minimal additional compute
- **Estimated Emissions:** < 0.01 kg CO₂eq
## Contact
- Website: [https://impresso-project.ch](https://impresso-project.ch)
<p align="center">
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p> |