|
|
|
TrOCR |
|
Overview |
|
The TrOCR model was proposed in TrOCR: Transformer-based Optical Character Recognition with Pre-trained |
|
Models by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, |
|
Zhoujun Li, Furu Wei. TrOCR consists of an image Transformer encoder and an autoregressive text Transformer decoder to |
|
perform optical character recognition (OCR). |
|
The abstract from the paper is the following: |
|
Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition |
|
are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language |
|
model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end |
|
text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the |
|
Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but |
|
effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments |
|
show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition |
|
tasks. |
|
|
|
TrOCR architecture. Taken from the original paper. |
|
Please refer to the [VisionEncoderDecoder] class on how to use this model. |
|
This model was contributed by nielsr. The original code can be found |
|
here. |
|
Usage tips |
|
|
|
The quickest way to get started with TrOCR is by checking the tutorial |
|
notebooks, which show how to use the model |
|
at inference time as well as fine-tuning on custom data. |
|
TrOCR is pre-trained in 2 stages before being fine-tuned on downstream datasets. It achieves state-of-the-art results |
|
on both printed (e.g. the SROIE dataset and handwritten (e.g. the IAM |
|
Handwriting dataset text recognition tasks. For more |
|
information, see the official models. |
|
TrOCR is always used within the VisionEncoderDecoder framework. |
|
|
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with TrOCR. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
|
|
A blog post on Accelerating Document AI with TrOCR. |
|
A blog post on how to Document AI with TrOCR. |
|
A notebook on how to finetune TrOCR on IAM Handwriting Database using Seq2SeqTrainer. |
|
A notebook on inference with TrOCR and Gradio demo. |
|
A notebook on finetune TrOCR on the IAM Handwriting Database using native PyTorch. |
|
A notebook on evaluating TrOCR on the IAM test set. |
|
|
|
Casual language modeling task guide. |
|
|
|
⚡️ Inference |
|
|
|
An interactive-demo on TrOCR handwritten character recognition. |
|
|
|
Inference |
|
TrOCR's [VisionEncoderDecoder] model accepts images as input and makes use of |
|
[~generation.GenerationMixin.generate] to autoregressively generate text given the input image. |
|
The [ViTImageProcessor/DeiTImageProcessor] class is responsible for preprocessing the input image and |
|
[RobertaTokenizer/XLMRobertaTokenizer] decodes the generated target tokens to the target string. The |
|
[TrOCRProcessor] wraps [ViTImageProcessor/DeiTImageProcessor] and [RobertaTokenizer/XLMRobertaTokenizer] |
|
into a single instance to both extract the input features and decode the predicted token ids. |
|
|
|
Step-by-step Optical Character Recognition (OCR) |
|
|
|
``` py |
|
|
|
from transformers import TrOCRProcessor, VisionEncoderDecoderModel |
|
import requests |
|
from PIL import Image |
|
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") |
|
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten") |
|
load image from the IAM dataset |
|
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") |
|
pixel_values = processor(image, return_tensors="pt").pixel_values |
|
generated_ids = model.generate(pixel_values) |
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
See the model hub to look for TrOCR checkpoints. |
|
TrOCRConfig |
|
[[autodoc]] TrOCRConfig |
|
TrOCRProcessor |
|
[[autodoc]] TrOCRProcessor |
|
- call |
|
- from_pretrained |
|
- save_pretrained |
|
- batch_decode |
|
- decode |
|
TrOCRForCausalLM |
|
[[autodoc]] TrOCRForCausalLM |
|
- forward |