Commit
·
5e2a201
1
Parent(s):
0c32485
modified readme
Browse files- README.md +73 -12
- ocr_qa_assessment.py +1 -1
README.md
CHANGED
@@ -1,41 +1,102 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
language:
|
4 |
-
- en
|
5 |
- fr
|
6 |
- de
|
|
|
7 |
tags:
|
8 |
-
-
|
|
|
|
|
|
|
|
|
|
|
9 |
---
|
10 |
|
|
|
11 |
|
12 |
-
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
-
<!-- Provide a longer summary of what this model is. -->
|
16 |
```python
|
17 |
from transformers import pipeline
|
18 |
|
19 |
MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
|
20 |
|
21 |
ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
|
22 |
-
|
23 |
-
|
24 |
|
25 |
sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
|
26 |
le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
|
27 |
|
28 |
score = ocrqa_pipeline(sentence)
|
29 |
print(score)
|
30 |
-
|
31 |
```
|
32 |
|
33 |
-
|
34 |
|
|
|
|
|
|
|
|
|
35 |
```
|
36 |
-
Works with lists of sentences also.
|
37 |
|
38 |
-
|
39 |
|
40 |
-
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
language:
|
|
|
4 |
- fr
|
5 |
- de
|
6 |
+
license: gpl-3.0
|
7 |
tags:
|
8 |
+
- ocr
|
9 |
+
- bloomfilter
|
10 |
+
- unigram
|
11 |
+
- impresso
|
12 |
+
- quality-assessment
|
13 |
+
- v1.0.6
|
14 |
---
|
15 |
|
16 |
+
# Model Card for impresso-project/ocr-quality-assessor-unigram-light
|
17 |
|
18 |
+
## Overview
|
19 |
|
20 |
+
This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks.
|
21 |
+
|
22 |
+
It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration.
|
23 |
+
|
24 |
+
## Model Details
|
25 |
+
|
26 |
+
- **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline
|
27 |
+
- **Languages:** French (fr), German (de)
|
28 |
+
- **License:** GPL-3.0
|
29 |
+
- **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram)
|
30 |
+
- **Interface:** Transformers `pipeline`
|
31 |
+
- **Input format:** Raw text string
|
32 |
+
- **Output format:** Float score (OCR quality proxy)
|
33 |
+
- **Developed by:** UZH, Switzerland
|
34 |
+
|
35 |
+
## How to Use
|
36 |
|
|
|
37 |
```python
|
38 |
from transformers import pipeline
|
39 |
|
40 |
MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
|
41 |
|
42 |
ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
|
43 |
+
trust_remote_code=True,
|
44 |
+
device='cpu')
|
45 |
|
46 |
sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
|
47 |
le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
|
48 |
|
49 |
score = ocrqa_pipeline(sentence)
|
50 |
print(score)
|
|
|
51 |
```
|
52 |
|
53 |
+
## Output Format
|
54 |
|
55 |
+
Returns a single float value indicating the proportion of known tokens:
|
56 |
+
|
57 |
+
```python
|
58 |
+
{'ocr_quality_score': 0.76}
|
59 |
```
|
|
|
60 |
|
61 |
+
## Use Cases
|
62 |
|
63 |
+
- OCR pipeline evaluation and quality diagnostics
|
64 |
+
- Automated scoring of OCR segments or lines
|
65 |
+
- Quick feedback in web-based transcription and correction tools
|
66 |
+
|
67 |
+
## Dataset and Preprocessing
|
68 |
+
|
69 |
+
The Bloom filters used internally are derived from:
|
70 |
+
- Wikipedia dumps (historical and modern)
|
71 |
+
- Impresso-specific lexical resources
|
72 |
+
|
73 |
+
Text normalization includes:
|
74 |
+
- Unicode NFKC normalization
|
75 |
+
- Digit masking (0)
|
76 |
+
- Punctuation and symbol removal
|
77 |
+
- Lowercasing
|
78 |
+
|
79 |
+
## Limitations
|
80 |
+
|
81 |
+
- Currently supports only **French** and **German**
|
82 |
+
- Does not provide spell correction suggestions
|
83 |
+
- False positives are possible (due to the nature of Bloom filters)
|
84 |
+
- Quality score is approximate and works best at the **segment** or **line** level
|
85 |
+
|
86 |
+
## Environmental Impact
|
87 |
+
|
88 |
+
- **Hardware:** Standard laptop / CPU inference
|
89 |
+
- **Training:** Reuse of existing Bloom filters; minimal additional compute
|
90 |
+
- **Estimated Emissions:** < 0.01 kg CO₂eq
|
91 |
+
|
92 |
+
## Citation
|
93 |
+
|
94 |
+
Please cite the Impresso project if using this model in academic or research work.
|
95 |
+
|
96 |
+
## Contact
|
97 |
+
|
98 |
+
- Website: [https://impresso-project.ch](https://impresso-project.ch)
|
99 |
+
|
100 |
+
<p align="center">
|
101 |
+
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
|
102 |
+
</p>
|
ocr_qa_assessment.py
CHANGED
@@ -23,4 +23,4 @@ class QAAssessmentPipeline(Pipeline):
|
|
23 |
|
24 |
# Format as JSON-compatible dictionary
|
25 |
# model_output = {"label": label, "score": round(score, 4)}
|
26 |
-
return {"
|
|
|
23 |
|
24 |
# Format as JSON-compatible dictionary
|
25 |
# model_output = {"label": label, "score": round(score, 4)}
|
26 |
+
return {"ocr_quality_score": round(predictions[0], 4)}
|