Spaces:

Ahmadzei
/

RAG

Runtime error

App Files Files Community

RAG / knowledge_base /model_doc_layoutlmv2.txt

Ahmadzei

update 1

57bdca5 over 1 year ago

raw

history blame contribute delete

15.4 kB


	LayoutLMV2
	Overview
	The LayoutLMV2 model was proposed in LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
	Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves LayoutLM to obtain
	state-of-the-art results across several document image understanding benchmarks:

	information extraction from scanned documents: the FUNSD dataset (a
	collection of 199 annotated forms comprising more than 30,000 words), the CORD
	dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the SROIE dataset (a collection of 626 receipts for training and 347 receipts for testing)
	and the Kleister-NDA dataset (a collection of non-disclosure
	agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
	documents for testing).
	document image classification: the RVL-CDIP dataset (a collection of
	400,000 images belonging to one of 16 classes).
	document visual question answering: the DocVQA dataset (a collection of 50,000
	questions defined on 12,000+ document images).

	The abstract from the paper is the following:
	Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to
	its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this
	paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model
	architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked
	visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training
	stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention
	mechanism into the Transformer architecture, so that the model can fully understand the relative positional
	relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and
	achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks,
	including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852),
	RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at
	this https URL.
	LayoutLMv2 depends on detectron2, torchvision and tesseract. Run the
	following to install them:

	python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
	python -m pip install torchvision tesseract
	(If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.)
	Usage tips

	The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
	pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
	LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
	the self-attention layers. Details can be found on page 5 of the paper.
	Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found here.
	LayoutLMv2 uses Facebook AI's Detectron2 package for its visual
	backbone. See this link for installation
	instructions.
	In addition to input_ids, [~LayoutLMv2Model.forward] expects 2 additional inputs, namely
	image and bbox. The image input corresponds to the original document image in which the text
	tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
	document images, image should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
	torch.Tensor or a Detectron2.structures.ImageList. You don't need to normalize the channels, as this is
	done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
	in Detectron2 are pre-trained using the BGR format. The bbox input are the bounding boxes (i.e. 2D-positions)
	of the input text tokens. This is identical to [LayoutLMModel]. These can be obtained using an
	external OCR engine such as Google's Tesseract (there's a Python
	wrapper available). Each bounding box should be in (x0, y0, x1, y1)
	format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
	represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
	a 0-1000 scale. To normalize, you can use the following function:

	python
	def normalize_bbox(bbox, width, height):
	return [
	int(1000 * (bbox[0] / width)),
	int(1000 * (bbox[1] / height)),
	int(1000 * (bbox[2] / width)),
	int(1000 * (bbox[3] / height)),
	]
	Here, width and height correspond to the width and height of the original document in which the token
	occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
	follows:
	thon
	from PIL import Image
	image = Image.open(
	"name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
	)
	width, height = image.size

	However, this model includes a brand new [~transformers.LayoutLMv2Processor] which can be used to directly
	prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
	section below.

	Internally, [~transformers.LayoutLMv2Model] will send the image input through its visual backbone to
	obtain a lower-resolution feature map, whose shape is equal to the image_feature_pool_shape attribute of
	[~transformers.LayoutLMv2Config]. This feature map is then flattened to obtain a sequence of image tokens. As
	the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
	tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
	length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
	will have a shape of seq_length + image_feature_pool_shape[0] *
	config.image_feature_pool_shape[1].
	When calling [~transformers.LayoutLMv2Model.from_pretrained], a warning will be printed with a long list of
	parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
	statistics, which are going to have values when fine-tuning on a custom dataset.
	If you want to train the model in a distributed environment, make sure to call [synchronize_batch_norm] on the
	model in order to properly synchronize the batch normalization layers of the visual backbone.

	In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
	LayoutXLM's documentation page.
	Resources
	A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

	A notebook on how to finetune LayoutLMv2 for text-classification on RVL-CDIP dataset.
	See also: Text classification task guide

	A notebook on how to finetune LayoutLMv2 for question-answering on DocVQA dataset.
	See also: Question answering task guide
	See also: Document question answering task guide

	A notebook on how to finetune LayoutLMv2 for token-classification on CORD dataset.
	A notebook on how to finetune LayoutLMv2 for token-classification on FUNSD dataset.
	See also: Token classification task guide

	Usage: LayoutLMv2Processor
	The easiest way to prepare data for the model is to use [LayoutLMv2Processor], which internally
	combines a image processor ([LayoutLMv2ImageProcessor]) and a tokenizer
	([LayoutLMv2Tokenizer] or [LayoutLMv2TokenizerFast]). The image processor
	handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
	for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
	modality.
	thon
	from transformers import LayoutLMv2ImageProcessor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
	image_processor = LayoutLMv2ImageProcessor() # apply_ocr is set to True by default
	tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
	processor = LayoutLMv2Processor(image_processor, tokenizer)

	In short, one can provide a document image (and possibly additional data) to [LayoutLMv2Processor],
	and it will create the inputs expected by the model. Internally, the processor first uses
	[LayoutLMv2ImageProcessor] to apply OCR on the image to get a list of words and normalized
	bounding boxes, as well to resize the image to a given size in order to get the image input. The words and
	normalized bounding boxes are then provided to [LayoutLMv2Tokenizer] or
	[LayoutLMv2TokenizerFast], which converts them to token-level input_ids,
	attention_mask, token_type_ids, bbox. Optionally, one can provide word labels to the processor,
	which are turned into token-level labels.
	[LayoutLMv2Processor] uses PyTesseract, a Python
	wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
	choice, and provide the words and normalized boxes yourself. This requires initializing
	[LayoutLMv2ImageProcessor] with apply_ocr set to False.
	In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
	use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
	Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
	True
	This is the simplest case, in which the processor (actually the image processor) will perform OCR on the image to get
	the words and normalized bounding boxes.
	thon
	from transformers import LayoutLMv2Processor
	from PIL import Image
	processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
	image = Image.open(
	"name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
	).convert("RGB")
	encoding = processor(
	image, return_tensors="pt"
	) # you can also add all tokenizer parameters here such as padding, truncation
	print(encoding.keys())
	dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

	Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False
	In case one wants to do OCR themselves, one can initialize the image processor with apply_ocr set to
	False. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
	the processor.
	thon
	from transformers import LayoutLMv2Processor
	from PIL import Image
	processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
	image = Image.open(
	"name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
	).convert("RGB")
	words = ["hello", "world"]
	boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
	encoding = processor(image, words, boxes=boxes, return_tensors="pt")
	print(encoding.keys())
	dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

	Use case 3: token classification (training), apply_ocr=False
	For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
	labels in order to train a model. The processor will then convert these into token-level labels. By default, it
	will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
	ignore_index of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
	initialize the tokenizer with only_label_first_subword set to False.
	thon
	from transformers import LayoutLMv2Processor
	from PIL import Image
	processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
	image = Image.open(
	"name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
	).convert("RGB")
	words = ["hello", "world"]
	boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
	word_labels = [1, 2]
	encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
	print(encoding.keys())
	dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])

	Use case 4: visual question answering (inference), apply_ocr=True
	For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
	processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
	thon
	from transformers import LayoutLMv2Processor
	from PIL import Image
	processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
	image = Image.open(
	"name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
	).convert("RGB")
	question = "What's his name?"
	encoding = processor(image, question, return_tensors="pt")
	print(encoding.keys())
	dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

	Use case 5: visual question answering (inference), apply_ocr=False
	For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
	perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
	thon
	from transformers import LayoutLMv2Processor
	from PIL import Image
	processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
	image = Image.open(
	"name_of_your_document - can be a png, jpg, etc. of your documents (PDFs must be converted to images)."
	).convert("RGB")
	question = "What's his name?"
	words = ["hello", "world"]
	boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
	encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
	print(encoding.keys())
	dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])

	LayoutLMv2Config
	[[autodoc]] LayoutLMv2Config
	LayoutLMv2FeatureExtractor
	[[autodoc]] LayoutLMv2FeatureExtractor
	- call
	LayoutLMv2ImageProcessor
	[[autodoc]] LayoutLMv2ImageProcessor
	- preprocess
	LayoutLMv2Tokenizer
	[[autodoc]] LayoutLMv2Tokenizer
	- call
	- save_vocabulary
	LayoutLMv2TokenizerFast
	[[autodoc]] LayoutLMv2TokenizerFast
	- call
	LayoutLMv2Processor
	[[autodoc]] LayoutLMv2Processor
	- call
	LayoutLMv2Model
	[[autodoc]] LayoutLMv2Model
	- forward
	LayoutLMv2ForSequenceClassification
	[[autodoc]] LayoutLMv2ForSequenceClassification
	LayoutLMv2ForTokenClassification
	[[autodoc]] LayoutLMv2ForTokenClassification
	LayoutLMv2ForQuestionAnswering
	[[autodoc]] LayoutLMv2ForQuestionAnswering