|
--- |
|
library_name: transformers |
|
tags: |
|
- vision-language |
|
- llava |
|
- mistral |
|
- qlora |
|
- 4bit |
|
- document-vqa |
|
- fine-tuned |
|
license: apache-2.0 |
|
--- |
|
|
|
# Model Card for `avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3` |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
This model is a **vision–language model** fine‑tuned for **document visual question answering (DocVQA)**. |
|
It is based on **LLaVA v1.6 with a Mistral‑7B backbone**, loaded in **4‑bit quantized format** using QLoRA for efficient fine‑tuning. |
|
|
|
- **Developed by:** Avishek Jana |
|
- **Model type:** Multimodal (image + text) encoder–decoder style generation model |
|
- **Language(s):** English |
|
- **License:** Apache 2.0 (inherits from base model and dataset license) |
|
- **Finetuned from:** [`llava-hf/llava-v1.6-mistral-7b`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b) |
|
|
|
### Model Sources |
|
- **Repository:** [Hugging Face Hub](https://huggingface.co/avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3) |
|
- **Base paper:** [LLaVA: Large Language and Vision Assistant](https://arxiv.org/abs/2304.08485) |
|
- **Base model:** [llava-hf/llava-v1.6-mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b) |
|
|
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
- Designed for **document-based visual question answering**: |
|
- Provide a scanned or digital document image (ID card, form, receipt, etc.) |
|
- Ask a question about the content of that document. |
|
- The model generates an answer based on both visual layout and text. |
|
|
|
**Example:** |
|
*Question:* “What is the name of the person present in ID card?” |
|
*Image:* ID screenshot |
|
*Output:* “Avishek Jana” |
|
|
|
### Downstream Use |
|
- Can be integrated into: |
|
- Document analysis pipelines |
|
- Intelligent chatbots handling PDFs/images |
|
- Enterprise OCR/IDP systems |
|
|
|
### Out‑of‑Scope Use |
|
- Natural scene VQA (street signs, animals, etc.) – not optimized for that. |
|
- Medical or highly sensitive decision‑making without human oversight. |
|
- Non‑English documents (limited support). |
|
|
|
--- |
|
|
|
## Bias, Risks, and Limitations |
|
- **Language:** Primarily trained for English; may not work well with other languages. |
|
- **Document type:** Best on structured documents; may struggle with handwritten or very low‑quality scans. |
|
- **Bias:** Inherits biases from the base model (Mistral‑7B) and the training data. |
|
- **Risk:** Should not be relied on for critical compliance decisions without human verification. |
|
|
|
### Recommendations |
|
Always verify answers when using in high‑stakes scenarios. Mask or redact sensitive data before sharing with third parties. |
|
|
|
--- |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextForConditionalGeneration |
|
import torch |
|
from datasets import load_dataset |
|
from pprint import pprint |
|
|
|
MODEL_ID = "llava-hf/llava-v1.6-mistral-7b-hf" |
|
REPO_ID = "avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3" |
|
|
|
processor = AutoProcessor.from_pretrained(MODEL_ID) |
|
|
|
# Define quantization config |
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16 |
|
) |
|
# Load the base model with adapters on top |
|
model = LlavaNextForConditionalGeneration.from_pretrained( |
|
REPO_ID, |
|
torch_dtype=torch.float16, |
|
quantization_config=quantization_config, |
|
) |
|
|
|
from PIL import Image |
|
import requests |
|
from io import BytesIO |
|
|
|
image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" |
|
response = requests.get(image_url) |
|
image = Image.open(BytesIO(response.content)) |
|
|
|
prompt = f"[INST] <image>\nWhat the lady is holding? [/INST]" |
|
max_output_token = 256 |
|
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0") |
|
output = model.generate(**inputs, max_new_tokens=max_output_token) |
|
response = processor.decode(output[0], skip_special_tokens=True) |
|
response |