|
|
|
LayoutLM |
|
|
|
Overview |
|
The LayoutLM model was proposed in the paper LayoutLM: Pre-training of Text and Layout for Document Image |
|
Understanding by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and |
|
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and |
|
information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results |
|
on several downstream tasks: |
|
|
|
form understanding: the FUNSD dataset (a collection of 199 annotated |
|
forms comprising more than 30,000 words). |
|
receipt understanding: the SROIE dataset (a collection of 626 receipts for |
|
training and 347 receipts for testing). |
|
document image classification: the RVL-CDIP dataset (a collection of |
|
400,000 images belonging to one of 16 classes). |
|
|
|
The abstract from the paper is the following: |
|
Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the |
|
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation, |
|
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose |
|
the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is |
|
beneficial for a great number of real-world document image understanding tasks such as information extraction from |
|
scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. |
|
To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for |
|
document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form |
|
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification |
|
(from 93.07 to 94.42). |
|
Usage tips |
|
|
|
In addition to input_ids, [~transformers.LayoutLMModel.forward] also expects the input bbox, which are |
|
the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such |
|
as Google's Tesseract (there's a Python wrapper available). Each bounding box should be in (x0, y0, x1, y1) format, where |
|
(x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the |
|
position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000 |
|
scale. To normalize, you can use the following function: |
|
|
|
python |
|
def normalize_bbox(bbox, width, height): |
|
return [ |
|
int(1000 * (bbox[0] / width)), |
|
int(1000 * (bbox[1] / height)), |
|
int(1000 * (bbox[2] / width)), |
|
int(1000 * (bbox[3] / height)), |
|
] |
|
Here, width and height correspond to the width and height of the original document in which the token |
|
occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows: |
|
thon |
|
from PIL import Image |
|
Document can be a png, jpg, etc. PDFs must be converted to images. |
|
image = Image.open(name_of_your_document).convert("RGB") |
|
width, height = image.size |
|
|
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLM. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
|
|
A blog post on fine-tuning |
|
LayoutLM for document-understanding using Keras & Hugging Face |
|
Transformers. |
|
|
|
A blog post on how to fine-tune LayoutLM for document-understanding using only Hugging Face Transformers. |
|
|
|
A notebook on how to fine-tune LayoutLM on the FUNSD dataset with image embeddings. |
|
|
|
See also: Document question answering task guide |
|
|
|
A notebook on how to fine-tune LayoutLM for sequence classification on the RVL-CDIP dataset. |
|
Text classification task guide |
|
|
|
A notebook on how to fine-tune LayoutLM for token classification on the FUNSD dataset. |
|
Token classification task guide |
|
|
|
Other resources |
|
- Masked language modeling task guide |
|
🚀 Deploy |
|
|
|
A blog post on how to Deploy LayoutLM with Hugging Face Inference Endpoints. |
|
|
|
LayoutLMConfig |
|
[[autodoc]] LayoutLMConfig |
|
LayoutLMTokenizer |
|
[[autodoc]] LayoutLMTokenizer |
|
LayoutLMTokenizerFast |
|
[[autodoc]] LayoutLMTokenizerFast |
|
|
|
LayoutLMModel |
|
[[autodoc]] LayoutLMModel |
|
LayoutLMForMaskedLM |
|
[[autodoc]] LayoutLMForMaskedLM |
|
LayoutLMForSequenceClassification |
|
[[autodoc]] LayoutLMForSequenceClassification |
|
LayoutLMForTokenClassification |
|
[[autodoc]] LayoutLMForTokenClassification |
|
LayoutLMForQuestionAnswering |
|
[[autodoc]] LayoutLMForQuestionAnswering |
|
|
|
TFLayoutLMModel |
|
[[autodoc]] TFLayoutLMModel |
|
TFLayoutLMForMaskedLM |
|
[[autodoc]] TFLayoutLMForMaskedLM |
|
TFLayoutLMForSequenceClassification |
|
[[autodoc]] TFLayoutLMForSequenceClassification |
|
TFLayoutLMForTokenClassification |
|
[[autodoc]] TFLayoutLMForTokenClassification |
|
TFLayoutLMForQuestionAnswering |
|
[[autodoc]] TFLayoutLMForQuestionAnswering |
|
|
|
|