Model Card for `avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3`

Model Details

Model Description

This model is a vision–language model fine‑tuned for document visual question answering (DocVQA).
It is based on LLaVA v1.6 with a Mistral‑7B backbone, loaded in 4‑bit quantized format using QLoRA for efficient fine‑tuning.

Developed by: Avishek Jana
Model type: Multimodal (image + text) encoder–decoder style generation model
Language(s): English
License: Apache 2.0 (inherits from base model and dataset license)
Finetuned from: llava-hf/llava-v1.6-mistral-7b

Model Sources

Repository: Hugging Face Hub
Base paper: LLaVA: Large Language and Vision Assistant
Base model: llava-hf/llava-v1.6-mistral-7b

Uses

Direct Use

Designed for document-based visual question answering:
- Provide a scanned or digital document image (ID card, form, receipt, etc.)
- Ask a question about the content of that document.
- The model generates an answer based on both visual layout and text.

Example:
Question: “What is the name of the person present in ID card?”
Image: ID screenshot
Output: “Avishek Jana”

Downstream Use

Can be integrated into:
- Document analysis pipelines
- Intelligent chatbots handling PDFs/images
- Enterprise OCR/IDP systems

Out‑of‑Scope Use

Natural scene VQA (street signs, animals, etc.) – not optimized for that.
Medical or highly sensitive decision‑making without human oversight.
Non‑English documents (limited support).

Bias, Risks, and Limitations

Language: Primarily trained for English; may not work well with other languages.
Document type: Best on structured documents; may struggle with handwritten or very low‑quality scans.
Bias: Inherits biases from the base model (Mistral‑7B) and the training data.
Risk: Should not be relied on for critical compliance decisions without human verification.

Recommendations

Always verify answers when using in high‑stakes scenarios. Mask or redact sensitive data before sharing with third parties.

How to Get Started with the Model

from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextForConditionalGeneration
import torch
from datasets import load_dataset
from pprint import pprint

MODEL_ID = "llava-hf/llava-v1.6-mistral-7b-hf"
REPO_ID = "avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3"

processor = AutoProcessor.from_pretrained(MODEL_ID)

# Define quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
)
# Load the base model with adapters on top
model = LlavaNextForConditionalGeneration.from_pretrained(
    REPO_ID,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

from PIL import Image
import requests
from io import BytesIO

image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

prompt = f"[INST] <image>\nWhat the lady is holding? [/INST]"
max_output_token = 256
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
output = model.generate(**inputs, max_new_tokens=max_output_token)
response = processor.decode(output[0], skip_special_tokens=True)
response

Model Card for avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3