Basic Questions
Is this model supposed to be an alternative to the VLMs used in Docling (Granite Vision, SmolVLM, etc)? Is it intended to be used as a module within Docling?
Or, is this model supposed to be an alternative to Docling entirely, where a document is provided to SmolDocling and some text is generated?
Hello
@tintegral
,
SmolDocling is a part of a Docling family, can be used independently (see examples in the model card) and together within Docling,
it is already integrated into Docling:
- To use in Python: see SmolDocling with VLM pipeline example https://github.com/docling-project/docling/blob/main/docs/examples/minimal_vlm_pipeline.py
- Within Docling CLI:
docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
Of course we also include DocTags processing routines into Docling, so you can convert DocTags that SmolDocling outputs into MD or HTML with Docling, for example.
Thank you for the reply!
Still thinking through this ... prior to SmolDocling, my understanding was that we might convert a document with Docling using something like EasyOCR to do OCR, TableFormer to process tables, and using HF SmolVLM to caption pictures. Can SmolDocling take this same role - isolated to captioning pictures as an aide to Docling?
In the CLI command you shared (docling --pipeline vlm --vlm-model smoldocling https://arxiv.org/pdf/2206.01062
) are all processing tasks handled instead by SmolDocling (text recognition, table recognition, picture captioning, etc.)?
@tintegral yes, SmolDocling is an end-to-end model which replaces most of the individual models in the standard Docling pipeline (OCR, Layout, TableFormer, equations, code, ...). The image captioning however is an enrichment that currently happens afterwards, with different models at your choice. We have not tested SmolDocling for image captioning, since it has been trained for conversion tasks and SmolVLM, the base model, can already do image captioning.
I can confirm that we are able to get image captioning working using SmolDocling. We'll need to spend some time refining but, in general, it correctly identifies the images and produces reasonable descriptions / text extraction. Early results are better than SmolVLM.
Captioning is important to us because it would be ideal to pass documents through a single model which produces text representations of everything on a page.
Also, can the team please explain the level of sophistication with regard to charts? I see in the Gradio demo here the case where the bar chart is converted to OTSL. Is it currently known whether, when a document is passed into SmolDocling, the charts can be interpreted and described in text (similar to my question about image captions) with those descriptions being at the same location in the document as the original chart? Is that yet automated somehow? Would this be handled through a prompt like: convert all charts to tables (like as mentioned in the model card)?
Thank you for this great model. I have it outputting with doctags into MD and .txt files, but I now want to convert into MD format (without the doctags being present in the final MD). In other words, I want to leverage the doctags to get the best MD version of a PDF possible, but I don't want the doctags in the MD file. I can't seem to find documentation on how to do this. Any ideas or have I misunderstood the use case here and the idea is that the doctags remain in the final MD file?
@AcidLuigi , I'm responding since I received a notification but I don't think you were asking me, obvious question but: are you specifying markdown as your output format? I am able to get clean markdown output.
https://docling-project.github.io/docling/reference/cli/
--to choice (md | json | html | text | doctags) Specify output formats. Defaults to Markdown.
Thanks for your help. I am running the model via a Python script rather than direct cli approach, but I am a bit of a noob, so apologies if I appear to be making little sense.
Here is the part of my script that converts to MD. Do you see any immediate issues? Currently the MD files which are generated contain parts of the doc tags. For example: 54>245>292>253>Sehr geehrte
Script:
if self.save_md:
self.logger.info(f"Converting to Markdown format: {output_md_path}")
try:
# Create DocTagsDocument from the doctags and images
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(all_page_doctags, all_page_images)
# Create and load DoclingDocument
doc = DoclingDocument(name=input_pdf_path.stem)
doc.load_from_doctags(doctags_doc)
# Save as Markdown
doc.save_as_markdown(output_md_path)
self.logger.info("Markdown file saved successfully")
self.progress_update.emit(f" -> Markdown file saved successfully.")
except Exception as md_err:
self.logger.error(f"Error saving .md file: {md_err}", exc_info=True)
self.progress_update.emit(f" -> ERROR saving .md file: {md_err}")
self.error_occurred.emit(f"Failed to write MD for {input_pdf_path.name}")
export_success = False
Looking at your implementation compared to the Hugging Face Space code, I notice a few key differences in how the doctags are processed before conversion to Markdown.
The Space code includes these important pre-processing steps before creating the DocTagsDocument:
# Clean the output
cleaned_output = full_output.replace("<end_of_utterance>", "").strip()
doctag_output = cleaned_output
# Handle special cases for charts and fix location tags
if any(tag in doctag_output for tag in ["<doctag>", "<otsl>", "<code>", "<chart>", "<formula>"]):
doc = DoclingDocument(name="Document")
if "<chart>" in doctag_output:
doctag_output = doctag_output.replace("<chart>", "<otsl>").replace("</chart>", "</otsl>")
doctag_output = re.sub(r'(<loc_500>)(?!.*<loc_500>)<[^>]+>', r'\1', doctag_output)
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctag_output], images)
doc.load_from_doctags(doctags_doc)
In your implementation, you're passing the raw doctags directly to from_doctags_and_image_pairs()
without this pre-processing. Try adding similar cleaning steps to your code:
if self.save_md:
self.logger.info(f"Converting to Markdown format: {output_md_path}")
try:
# Pre-process doctags before conversion
processed_doctags = []
for doctag in all_page_doctags:
# Clean the output
cleaned_doctag = doctag.replace("<end_of_utterance>", "").strip()
# Handle special cases for charts
if "<chart>" in cleaned_doctag:
cleaned_doctag = cleaned_doctag.replace("<chart>", "<otsl>").replace("</chart>", "</otsl>")
cleaned_doctag = re.sub(r'(<loc_500>)(?!.*<loc_500>)<[^>]+>', r'\1', cleaned_doctag)
processed_doctags.append(cleaned_doctag)
# Create DocTagsDocument from the processed doctags and images
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(processed_doctags, all_page_images)
# Create and load DoclingDocument
doc = DoclingDocument(name=input_pdf_path.stem)
doc.load_from_doctags(doctags_doc)
# Save as Markdown
doc.save_as_markdown(output_md_path)
self.logger.info("Markdown file saved successfully")
self.progress_update.emit(f" -> Markdown file saved successfully.")
except Exception as md_err:
self.logger.error(f"Error saving .md file: {md_err}", exc_info=True)
self.progress_update.emit(f" -> ERROR saving .md file: {md_err}")
self.error_occurred.emit(f"Failed to write MD for {input_pdf_path.name}")
export_success = False
The issue with tags like "54>245>292>253>Sehr geehrte" appearing in your output suggests that the location tags aren't being properly processed. The location tags in doctags format need special handling to be correctly interpreted by the DoclingDocument processor.
If these changes don't resolve the issue, I recommend studying the complete implementation in the Hugging Face Space code, particularly the DoclingDocument
and DocTagsDocument
classes, to understand all the nuances of the conversion process.
https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo
https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo/blob/main/app.py
Handwritten text was not extracted.
could you please suggest how to perform OCR on handwritten text?
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
smoldocling_vlm_conversion_options,
)
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
from PIL import Image
import base64
from io import BytesIO
Image(s) to process
sources = [
"/content/invoice-sample.jpg", # Replace with your actual image path
]
pipeline_options1 = PdfPipelineOptions()
Configure the pipeline
pipeline_options = VlmPipelineOptions()
pipeline_options.force_backend_text = False
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
pipeline_options1.do_table_structure = True
pipeline_options1.table_structure_options.do_cell_matching = True
✅ Use GPU (optional)
pipeline_options.accelerator_options.device = "cuda"
pipeline_options.accelerator_options.cuda_use_flash_attention2 = False
Build converter for images
doc_converter = DocumentConverter(
format_options={
InputFormat.IMAGE: ImageFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
Run OCR and export to Markdown
for file_path in sources:
image_path = Path(file_path)
result = doc_converter.convert(image_path)
# Export Markdown with embedded image refs
markdown = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
# If no image was embedded by Docling, embed the full input image manually
if "data:image" not in markdown:
with Image.open(image_path) as img:
buffered = BytesIO()
img.save(buffered, format="PNG")
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
markdown += f"\n\n"
print("\n=== Markdown with Image, Text, and Table Structure ===\n")
print(markdown)
for this code this is the output Truncation was not explicitly activated but max_length
is provided a specific value, please use truncation=True
to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation
.
=== Markdown with Image, Text, and Table Structure ===
![Image]
Your Company LLC Address 123, State, My Country P 111-222-333, F 111-222-334
BILL TO:
John Doe
Alpha Bravo Road 33
P: 111-222-333, F: 111-222-334
SHIPPING TO:
John Doe Office
Office Road 38
P: 111-333-222, F: 122-222-334
NO PRODUCTS / SERVICE
QUANTITY / HOURS
RATE / UNIT
AMOUNT
1 Tyre
2
$20
$40
2 Steering Wheel
5
$10
$50
3 Engine Oil
10
$15
$150
4 Brake Pad
24
$1000
$2,400
Subtotal
$275
Tax (10%)
$275
Grand Total
$302.5
what i want here is that when using vlm pipeline options can we have do table structure to maintain the table structure along with extraction for this image