--- library_name: transformers tags: - vision-language - llava - mistral - qlora - 4bit - document-vqa - fine-tuned license: apache-2.0 --- # Model Card for `avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3` ## Model Details ### Model Description This model is a **vision–language model** fine‑tuned for **document visual question answering (DocVQA)**. It is based on **LLaVA v1.6 with a Mistral‑7B backbone**, loaded in **4‑bit quantized format** using QLoRA for efficient fine‑tuning. - **Developed by:** Avishek Jana - **Model type:** Multimodal (image + text) encoder–decoder style generation model - **Language(s):** English - **License:** Apache 2.0 (inherits from base model and dataset license) - **Finetuned from:** [`llava-hf/llava-v1.6-mistral-7b`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b) ### Model Sources - **Repository:** [Hugging Face Hub](https://huggingface.co/avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3) - **Base paper:** [LLaVA: Large Language and Vision Assistant](https://arxiv.org/abs/2304.08485) - **Base model:** [llava-hf/llava-v1.6-mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b) --- ## Uses ### Direct Use - Designed for **document-based visual question answering**: - Provide a scanned or digital document image (ID card, form, receipt, etc.) - Ask a question about the content of that document. - The model generates an answer based on both visual layout and text. **Example:** *Question:* “What is the name of the person present in ID card?” *Image:* ID screenshot *Output:* “Avishek Jana” ### Downstream Use - Can be integrated into: - Document analysis pipelines - Intelligent chatbots handling PDFs/images - Enterprise OCR/IDP systems ### Out‑of‑Scope Use - Natural scene VQA (street signs, animals, etc.) – not optimized for that. - Medical or highly sensitive decision‑making without human oversight. - Non‑English documents (limited support). --- ## Bias, Risks, and Limitations - **Language:** Primarily trained for English; may not work well with other languages. - **Document type:** Best on structured documents; may struggle with handwritten or very low‑quality scans. - **Bias:** Inherits biases from the base model (Mistral‑7B) and the training data. - **Risk:** Should not be relied on for critical compliance decisions without human verification. ### Recommendations Always verify answers when using in high‑stakes scenarios. Mask or redact sensitive data before sharing with third parties. --- ## How to Get Started with the Model ```python from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextForConditionalGeneration import torch from datasets import load_dataset from pprint import pprint MODEL_ID = "llava-hf/llava-v1.6-mistral-7b-hf" REPO_ID = "avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3" processor = AutoProcessor.from_pretrained(MODEL_ID) # Define quantization config quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16 ) # Load the base model with adapters on top model = LlavaNextForConditionalGeneration.from_pretrained( REPO_ID, torch_dtype=torch.float16, quantization_config=quantization_config, ) from PIL import Image import requests from io import BytesIO image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" response = requests.get(image_url) image = Image.open(BytesIO(response.content)) prompt = f"[INST] \nWhat the lady is holding? [/INST]" max_output_token = 256 inputs = processor(prompt, image, return_tensors="pt").to("cuda:0") output = model.generate(**inputs, max_new_tokens=max_output_token) response = processor.decode(output[0], skip_special_tokens=True) response