Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
conversational

Basic Questions

#28
by tintegral - opened
  1. Is this model supposed to be an alternative to the VLMs used in Docling (Granite Vision, SmolVLM, etc)? Is it intended to be used as a module within Docling?

  2. Or, is this model supposed to be an alternative to Docling entirely, where a document is provided to SmolDocling and some text is generated?

Docling org
edited Mar 24

Hello @tintegral ,
SmolDocling is a part of a Docling family, can be used independently (see examples in the model card) and together within Docling,
it is already integrated into Docling:

Of course we also include DocTags processing routines into Docling, so you can convert DocTags that SmolDocling outputs into MD or HTML with Docling, for example.

Thank you for the reply!

Still thinking through this ... prior to SmolDocling, my understanding was that we might convert a document with Docling using something like EasyOCR to do OCR, TableFormer to process tables, and using HF SmolVLM to caption pictures. Can SmolDocling take this same role - isolated to captioning pictures as an aide to Docling?

In the CLI command you shared (docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062) are all processing tasks handled instead by SmolDocling (text recognition, table recognition, picture captioning, etc.)?

Docling org

@tintegral yes, SmolDocling is an end-to-end model which replaces most of the individual models in the standard Docling pipeline (OCR, Layout, TableFormer, equations, code, ...). The image captioning however is an enrichment that currently happens afterwards, with different models at your choice. We have not tested SmolDocling for image captioning, since it has been trained for conversion tasks and SmolVLM, the base model, can already do image captioning.

I can confirm that we are able to get image captioning working using SmolDocling. We'll need to spend some time refining but, in general, it correctly identifies the images and produces reasonable descriptions / text extraction. Early results are better than SmolVLM.

Captioning is important to us because it would be ideal to pass documents through a single model which produces text representations of everything on a page.

Also, can the team please explain the level of sophistication with regard to charts? I see in the Gradio demo here the case where the bar chart is converted to OTSL. Is it currently known whether, when a document is passed into SmolDocling, the charts can be interpreted and described in text (similar to my question about image captions) with those descriptions being at the same location in the document as the original chart? Is that yet automated somehow? Would this be handled through a prompt like: convert all charts to tables (like as mentioned in the model card)?

Thank you for this great model. I have it outputting with doctags into MD and .txt files, but I now want to convert into MD format (without the doctags being present in the final MD). In other words, I want to leverage the doctags to get the best MD version of a PDF possible, but I don't want the doctags in the MD file. I can't seem to find documentation on how to do this. Any ideas or have I misunderstood the use case here and the idea is that the doctags remain in the final MD file?

@AcidLuigi , I'm responding since I received a notification but I don't think you were asking me, obvious question but: are you specifying markdown as your output format? I am able to get clean markdown output.

https://docling-project.github.io/docling/reference/cli/

--to choice (md | json | html | text | doctags) Specify output formats. Defaults to Markdown.

Thanks for your help. I am running the model via a Python script rather than direct cli approach, but I am a bit of a noob, so apologies if I appear to be making little sense.

Here is the part of my script that converts to MD. Do you see any immediate issues? Currently the MD files which are generated contain parts of the doc tags. For example: 54>245>292>253>Sehr geehrte

Script:

                    if self.save_md:
                        self.logger.info(f"Converting to Markdown format: {output_md_path}")
                        try:
                            # Create DocTagsDocument from the doctags and images
                            doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(all_page_doctags, all_page_images)
                            
                            # Create and load DoclingDocument
                            doc = DoclingDocument(name=input_pdf_path.stem)
                            doc.load_from_doctags(doctags_doc)
                            
                            # Save as Markdown
                            doc.save_as_markdown(output_md_path)
                            self.logger.info("Markdown file saved successfully")
                            self.progress_update.emit(f"    -> Markdown file saved successfully.")
                        except Exception as md_err:
                            self.logger.error(f"Error saving .md file: {md_err}", exc_info=True)
                            self.progress_update.emit(f"    -> ERROR saving .md file: {md_err}")
                            self.error_occurred.emit(f"Failed to write MD for {input_pdf_path.name}")
                            export_success = False

@AcidLuigi

Looking at your implementation compared to the Hugging Face Space code, I notice a few key differences in how the doctags are processed before conversion to Markdown.

The Space code includes these important pre-processing steps before creating the DocTagsDocument:

# Clean the output
cleaned_output = full_output.replace("<end_of_utterance>", "").strip()

doctag_output = cleaned_output

# Handle special cases for charts and fix location tags
if any(tag in doctag_output for tag in ["<doctag>", "<otsl>", "<code>", "<chart>", "<formula>"]):
    doc = DoclingDocument(name="Document")
    if "<chart>" in doctag_output:
        doctag_output = doctag_output.replace("<chart>", "<otsl>").replace("</chart>", "</otsl>")
        doctag_output = re.sub(r'(<loc_500>)(?!.*<loc_500>)<[^>]+>', r'\1', doctag_output)

    doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctag_output], images)
    doc.load_from_doctags(doctags_doc)

In your implementation, you're passing the raw doctags directly to from_doctags_and_image_pairs() without this pre-processing. Try adding similar cleaning steps to your code:

if self.save_md:
    self.logger.info(f"Converting to Markdown format: {output_md_path}")
    try:
        # Pre-process doctags before conversion
        processed_doctags = []
        for doctag in all_page_doctags:
            # Clean the output
            cleaned_doctag = doctag.replace("<end_of_utterance>", "").strip()
            
            # Handle special cases for charts
            if "<chart>" in cleaned_doctag:
                cleaned_doctag = cleaned_doctag.replace("<chart>", "<otsl>").replace("</chart>", "</otsl>")
                cleaned_doctag = re.sub(r'(<loc_500>)(?!.*<loc_500>)<[^>]+>', r'\1', cleaned_doctag)
            
            processed_doctags.append(cleaned_doctag)
        
        # Create DocTagsDocument from the processed doctags and images
        doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(processed_doctags, all_page_images)
        
        # Create and load DoclingDocument
        doc = DoclingDocument(name=input_pdf_path.stem)
        doc.load_from_doctags(doctags_doc)
        
        # Save as Markdown
        doc.save_as_markdown(output_md_path)
        self.logger.info("Markdown file saved successfully")
        self.progress_update.emit(f"    -> Markdown file saved successfully.")
    except Exception as md_err:
        self.logger.error(f"Error saving .md file: {md_err}", exc_info=True)
        self.progress_update.emit(f"    -> ERROR saving .md file: {md_err}")
        self.error_occurred.emit(f"Failed to write MD for {input_pdf_path.name}")
        export_success = False

The issue with tags like "54>245>292>253>Sehr geehrte" appearing in your output suggests that the location tags aren't being properly processed. The location tags in doctags format need special handling to be correctly interpreted by the DoclingDocument processor.

If these changes don't resolve the issue, I recommend studying the complete implementation in the Hugging Face Space code, particularly the DoclingDocument and DocTagsDocument classes, to understand all the nuances of the conversion process.

https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo
https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo/blob/main/app.py

Handwritten text was not extracted.

could you please suggest how to perform OCR on handwritten text?

from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode

from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
smoldocling_vlm_conversion_options,
)
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from PIL import Image
import base64
from io import BytesIO

Image(s) to process

sources = [
"/content/invoice-sample.jpg", # Replace with your actual image path
]
pipeline_options1 = PdfPipelineOptions()

Configure the pipeline

pipeline_options = VlmPipelineOptions()
pipeline_options.force_backend_text = False
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
pipeline_options1.do_table_structure = True
pipeline_options1.table_structure_options.do_cell_matching = True

✅ Use GPU (optional)

pipeline_options.accelerator_options.device = "cuda"
pipeline_options.accelerator_options.cuda_use_flash_attention2 = False

Build converter for images

doc_converter = DocumentConverter(
format_options={
InputFormat.IMAGE: ImageFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)

Run OCR and export to Markdown

for file_path in sources:
image_path = Path(file_path)
result = doc_converter.convert(image_path)

# Export Markdown with embedded image refs
markdown = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)

# If no image was embedded by Docling, embed the full input image manually
if "data:image" not in markdown:
    with Image.open(image_path) as img:
        buffered = BytesIO()
        img.save(buffered, format="PNG")
        img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
        markdown += f"\n\n![Full Image](data:image/png;base64,{img_base64})"

print("\n=== Markdown with Image, Text, and Table Structure ===\n")
print(markdown)

for this code this is the output Truncation was not explicitly activated but max_length is provided a specific value, please use truncation=True to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.

=== Markdown with Image, Text, and Table Structure ===

![Image]

Your Company LLC Address 123, State, My Country P 111-222-333, F 111-222-334

BILL TO:

John Doe

Alpha Bravo Road 33

P: 111-222-333, F: 111-222-334

client@example.net

SHIPPING TO:

John Doe Office

Office Road 38

P: 111-333-222, F: 122-222-334

office@example.net

NO PRODUCTS / SERVICE

QUANTITY / HOURS

RATE / UNIT

AMOUNT

1 Tyre

2

$20

$40

2 Steering Wheel

5

$10

$50

3 Engine Oil

10

$15

$150

4 Brake Pad

24

$1000

$2,400

Subtotal

$275

Tax (10%)

$275

Grand Total

$302.5
what i want here is that when using vlm pipeline options can we have do table structure to maintain the table structure along with extraction for this image
invoice-sample.jpg

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment