File size: 3,888 Bytes

---
library_name: transformers
tags:
- vision-language
- llava
- mistral
- qlora
- 4bit
- document-vqa
- fine-tuned
license: apache-2.0
---

# Model Card for `avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3`

## Model Details

### Model Description
This model is a **vision–language model** fine‑tuned for **document visual question answering (DocVQA)**.  
It is based on **LLaVA v1.6 with a Mistral‑7B backbone**, loaded in **4‑bit quantized format** using QLoRA for efficient fine‑tuning.

- **Developed by:** Avishek Jana  
- **Model type:** Multimodal (image + text) encoder–decoder style generation model  
- **Language(s):** English  
- **License:** Apache 2.0 (inherits from base model and dataset license)  
- **Finetuned from:** [`llava-hf/llava-v1.6-mistral-7b`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b)

### Model Sources
- **Repository:** [Hugging Face Hub](https://huggingface.co/avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3)
- **Base paper:** [LLaVA: Large Language and Vision Assistant](https://arxiv.org/abs/2304.08485)
- **Base model:** [llava-hf/llava-v1.6-mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b)

---

## Uses

### Direct Use
- Designed for **document-based visual question answering**:
  - Provide a scanned or digital document image (ID card, form, receipt, etc.)
  - Ask a question about the content of that document.
  - The model generates an answer based on both visual layout and text.

**Example:**  
*Question:* “What is the name of the person present in ID card?”  
*Image:* ID screenshot  
*Output:* “Avishek Jana”

### Downstream Use
- Can be integrated into:
  - Document analysis pipelines
  - Intelligent chatbots handling PDFs/images
  - Enterprise OCR/IDP systems

### Out‑of‑Scope Use
- Natural scene VQA (street signs, animals, etc.) – not optimized for that.
- Medical or highly sensitive decision‑making without human oversight.
- Non‑English documents (limited support).

---

## Bias, Risks, and Limitations
- **Language:** Primarily trained for English; may not work well with other languages.
- **Document type:** Best on structured documents; may struggle with handwritten or very low‑quality scans.
- **Bias:** Inherits biases from the base model (Mistral‑7B) and the training data.
- **Risk:** Should not be relied on for critical compliance decisions without human verification.

### Recommendations
Always verify answers when using in high‑stakes scenarios. Mask or redact sensitive data before sharing with third parties.

---

## How to Get Started with the Model

```python
from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextForConditionalGeneration
import torch
from datasets import load_dataset
from pprint import pprint

MODEL_ID = "llava-hf/llava-v1.6-mistral-7b-hf"
REPO_ID = "avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3"

processor = AutoProcessor.from_pretrained(MODEL_ID)

# Define quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
)
# Load the base model with adapters on top
model = LlavaNextForConditionalGeneration.from_pretrained(
    REPO_ID,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
)

from PIL import Image
import requests
from io import BytesIO

image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

prompt = f"[INST] <image>\nWhat the lady is holding? [/INST]"
max_output_token = 256
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
output = model.generate(**inputs, max_new_tokens=max_output_token)
response = processor.decode(output[0], skip_special_tokens=True)
response