Model Card for avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3

Model Details

Model Description

This model is a vision–language model fine‑tuned for document visual question answering (DocVQA).
It is based on LLaVA v1.6 with a Mistral‑7B backbone, loaded in 4‑bit quantized format using QLoRA for efficient fine‑tuning.

  • Developed by: Avishek Jana
  • Model type: Multimodal (image + text) encoder–decoder style generation model
  • Language(s): English
  • License: Apache 2.0 (inherits from base model and dataset license)
  • Finetuned from: llava-hf/llava-v1.6-mistral-7b

Model Sources


Uses

Direct Use

  • Designed for document-based visual question answering:
    • Provide a scanned or digital document image (ID card, form, receipt, etc.)
    • Ask a question about the content of that document.
    • The model generates an answer based on both visual layout and text.

Example:
Question: “What is the name of the person present in ID card?”
Image: ID screenshot
Output: “Avishek Jana”

Downstream Use

  • Can be integrated into:
    • Document analysis pipelines
    • Intelligent chatbots handling PDFs/images
    • Enterprise OCR/IDP systems

Out‑of‑Scope Use

  • Natural scene VQA (street signs, animals, etc.) – not optimized for that.
  • Medical or highly sensitive decision‑making without human oversight.
  • Non‑English documents (limited support).

Bias, Risks, and Limitations

  • Language: Primarily trained for English; may not work well with other languages.
  • Document type: Best on structured documents; may struggle with handwritten or very low‑quality scans.
  • Bias: Inherits biases from the base model (Mistral‑7B) and the training data.
  • Risk: Should not be relied on for critical compliance decisions without human verification.

Recommendations

Always verify answers when using in high‑stakes scenarios. Mask or redact sensitive data before sharing with third parties.


How to Get Started with the Model

from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextForConditionalGeneration
import torch
from datasets import load_dataset
from pprint import pprint

MODEL_ID = "llava-hf/llava-v1.6-mistral-7b-hf"
REPO_ID = "avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3"

processor = AutoProcessor.from_pretrained(MODEL_ID)

# Define quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
)
# Load the base model with adapters on top
model = LlavaNextForConditionalGeneration.from_pretrained(
    REPO_ID,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

from PIL import Image
import requests
from io import BytesIO

image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

prompt = f"[INST] <image>\nWhat the lady is holding? [/INST]"
max_output_token = 256
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
output = model.generate(**inputs, max_new_tokens=max_output_token)
response = processor.decode(output[0], skip_special_tokens=True)
response
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support