emanuelaboros commited on
Commit
5e2a201
·
1 Parent(s): 0c32485

modified readme

Browse files
Files changed (2) hide show
  1. README.md +73 -12
  2. ocr_qa_assessment.py +1 -1
README.md CHANGED
@@ -1,41 +1,102 @@
1
  ---
2
  library_name: transformers
3
  language:
4
- - en
5
  - fr
6
  - de
 
7
  tags:
8
- - v1.0.0
 
 
 
 
 
9
  ---
10
 
 
11
 
12
- #### How to use
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- <!-- Provide a longer summary of what this model is. -->
16
  ```python
17
  from transformers import pipeline
18
 
19
  MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
20
 
21
  ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
22
- trust_remote_code=True,
23
- device='cpu')
24
 
25
  sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
26
  le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
27
 
28
  score = ocrqa_pipeline(sentence)
29
  print(score)
30
-
31
  ```
32
 
33
- ```
34
 
 
 
 
 
35
  ```
36
- Works with lists of sentences also.
37
 
38
- ### BibTeX entry and citation info
39
 
40
- ```
41
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
  language:
 
4
  - fr
5
  - de
6
+ license: gpl-3.0
7
  tags:
8
+ - ocr
9
+ - bloomfilter
10
+ - unigram
11
+ - impresso
12
+ - quality-assessment
13
+ - v1.0.6
14
  ---
15
 
16
+ # Model Card for impresso-project/ocr-quality-assessor-unigram-light
17
 
18
+ ## Overview
19
 
20
+ This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks.
21
+
22
+ It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration.
23
+
24
+ ## Model Details
25
+
26
+ - **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline
27
+ - **Languages:** French (fr), German (de)
28
+ - **License:** GPL-3.0
29
+ - **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram)
30
+ - **Interface:** Transformers `pipeline`
31
+ - **Input format:** Raw text string
32
+ - **Output format:** Float score (OCR quality proxy)
33
+ - **Developed by:** UZH, Switzerland
34
+
35
+ ## How to Use
36
 
 
37
  ```python
38
  from transformers import pipeline
39
 
40
  MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
41
 
42
  ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
43
+ trust_remote_code=True,
44
+ device='cpu')
45
 
46
  sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
47
  le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
48
 
49
  score = ocrqa_pipeline(sentence)
50
  print(score)
 
51
  ```
52
 
53
+ ## Output Format
54
 
55
+ Returns a single float value indicating the proportion of known tokens:
56
+
57
+ ```python
58
+ {'ocr_quality_score': 0.76}
59
  ```
 
60
 
61
+ ## Use Cases
62
 
63
+ - OCR pipeline evaluation and quality diagnostics
64
+ - Automated scoring of OCR segments or lines
65
+ - Quick feedback in web-based transcription and correction tools
66
+
67
+ ## Dataset and Preprocessing
68
+
69
+ The Bloom filters used internally are derived from:
70
+ - Wikipedia dumps (historical and modern)
71
+ - Impresso-specific lexical resources
72
+
73
+ Text normalization includes:
74
+ - Unicode NFKC normalization
75
+ - Digit masking (0)
76
+ - Punctuation and symbol removal
77
+ - Lowercasing
78
+
79
+ ## Limitations
80
+
81
+ - Currently supports only **French** and **German**
82
+ - Does not provide spell correction suggestions
83
+ - False positives are possible (due to the nature of Bloom filters)
84
+ - Quality score is approximate and works best at the **segment** or **line** level
85
+
86
+ ## Environmental Impact
87
+
88
+ - **Hardware:** Standard laptop / CPU inference
89
+ - **Training:** Reuse of existing Bloom filters; minimal additional compute
90
+ - **Estimated Emissions:** < 0.01 kg CO₂eq
91
+
92
+ ## Citation
93
+
94
+ Please cite the Impresso project if using this model in academic or research work.
95
+
96
+ ## Contact
97
+
98
+ - Website: [https://impresso-project.ch](https://impresso-project.ch)
99
+
100
+ <p align="center">
101
+ <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
102
+ </p>
ocr_qa_assessment.py CHANGED
@@ -23,4 +23,4 @@ class QAAssessmentPipeline(Pipeline):
23
 
24
  # Format as JSON-compatible dictionary
25
  # model_output = {"label": label, "score": round(score, 4)}
26
- return {"score": round(predictions[0], 4)}
 
23
 
24
  # Format as JSON-compatible dictionary
25
  # model_output = {"label": label, "score": round(score, 4)}
26
+ return {"ocr_quality_score": round(predictions[0], 4)}