impresso-project
/

ocr-quality-assessor-unigram-light

@@ -1,41 +1,102 @@
 ---
 library_name: transformers
 language:
-- en
 - fr
 - de
 tags:
-- v1.0.0
 ---
-#### How to use
-<!-- Provide a longer summary of what this model is. -->
 ```python
 from transformers import pipeline
 MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
 ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
-                        trust_remote_code=True,
-                        device='cpu')
 sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
           le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
 score = ocrqa_pipeline(sentence)
 print(score)
 ```
-```
 ```
-Works with lists of sentences also.
-### BibTeX entry and citation info
-```
-```

 ---
 library_name: transformers
 language:
 - fr
 - de
+license: gpl-3.0
 tags:
+- ocr
+- bloomfilter
+- unigram
+- impresso
+- quality-assessment
+- v1.0.6
 ---
+# Model Card for impresso-project/ocr-quality-assessor-unigram-light
+## Overview
+This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks.
+It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration.
+## Model Details
+- **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline
+- **Languages:** French (fr), German (de)
+- **License:** GPL-3.0
+- **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram)
+- **Interface:** Transformers `pipeline`
+- **Input format:** Raw text string
+- **Output format:** Float score (OCR quality proxy)
+- **Developed by:** UZH, Switzerland
+## How to Use
 ```python
 from transformers import pipeline
 MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light"
 ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME,
+                          trust_remote_code=True,
+                          device='cpu')
 sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
           le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
 score = ocrqa_pipeline(sentence)
 print(score)
 ```
+## Output Format
+Returns a single float value indicating the proportion of known tokens:
+```python
+{'ocr_quality_score': 0.76}
 ```
+## Use Cases
+- OCR pipeline evaluation and quality diagnostics
+- Automated scoring of OCR segments or lines
+- Quick feedback in web-based transcription and correction tools
+## Dataset and Preprocessing
+The Bloom filters used internally are derived from:
+- Wikipedia dumps (historical and modern)
+- Impresso-specific lexical resources
+Text normalization includes:
+- Unicode NFKC normalization
+- Digit masking (0)
+- Punctuation and symbol removal
+- Lowercasing
+## Limitations
+- Currently supports only **French** and **German**
+- Does not provide spell correction suggestions
+- False positives are possible (due to the nature of Bloom filters)
+- Quality score is approximate and works best at the **segment** or **line** level
+## Environmental Impact
+- **Hardware:** Standard laptop / CPU inference
+- **Training:** Reuse of existing Bloom filters; minimal additional compute
+- **Estimated Emissions:** < 0.01 kg CO₂eq
+## Citation
+Please cite the Impresso project if using this model in academic or research work.
+## Contact
+- Website: [https://impresso-project.ch](https://impresso-project.ch)
+<p align="center">
+  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
+</p>

ocr_qa_assessment.py CHANGED Viewed

@@ -23,4 +23,4 @@ class QAAssessmentPipeline(Pipeline):
         # Format as JSON-compatible dictionary
         # model_output = {"label": label, "score": round(score, 4)}
-        return {"score": round(predictions[0], 4)}

         # Format as JSON-compatible dictionary
         # model_output = {"label": label, "score": round(score, 4)}
+        return {"ocr_quality_score": round(predictions[0], 4)}