Udpated model cald

16b6dc8 verified 11 days ago

3.89 kB

	---
	library_name: transformers
	tags:
	- vision-language
	- llava
	- mistral
	- qlora
	- 4bit
	- document-vqa
	- fine-tuned
	license: apache-2.0
	---

	# Model Card for `avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3`

	## Model Details

	### Model Description
	This model is a vision–language model fine‑tuned for document visual question answering (DocVQA).
	It is based on LLaVA v1.6 with a Mistral‑7B backbone, loaded in 4‑bit quantized format using QLoRA for efficient fine‑tuning.

	- Developed by: Avishek Jana
	- Model type: Multimodal (image + text) encoder–decoder style generation model
	- Language(s): English
	- License: Apache 2.0 (inherits from base model and dataset license)
	- Finetuned from: [`llava-hf/llava-v1.6-mistral-7b`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b)

	### Model Sources
	- Repository: [Hugging Face Hub](https://huggingface.co/avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3)
	- Base paper: [LLaVA: Large Language and Vision Assistant](https://arxiv.org/abs/2304.08485)
	- Base model: [llava-hf/llava-v1.6-mistral-7b](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b)

	---

	## Uses

	### Direct Use
	- Designed for document-based visual question answering:
	- Provide a scanned or digital document image (ID card, form, receipt, etc.)
	- Ask a question about the content of that document.
	- The model generates an answer based on both visual layout and text.

	Example:
	Question: “What is the name of the person present in ID card?”
	Image: ID screenshot
	Output: “Avishek Jana”

	### Downstream Use
	- Can be integrated into:
	- Document analysis pipelines
	- Intelligent chatbots handling PDFs/images
	- Enterprise OCR/IDP systems

	### Out‑of‑Scope Use
	- Natural scene VQA (street signs, animals, etc.) – not optimized for that.
	- Medical or highly sensitive decision‑making without human oversight.
	- Non‑English documents (limited support).

	---

	## Bias, Risks, and Limitations
	- Language: Primarily trained for English; may not work well with other languages.
	- Document type: Best on structured documents; may struggle with handwritten or very low‑quality scans.
	- Bias: Inherits biases from the base model (Mistral‑7B) and the training data.
	- Risk: Should not be relied on for critical compliance decisions without human verification.

	### Recommendations
	Always verify answers when using in high‑stakes scenarios. Mask or redact sensitive data before sharing with third parties.

	---

	## How to Get Started with the Model

	```python
	from transformers import AutoProcessor, BitsAndBytesConfig, LlavaNextForConditionalGeneration
	import torch
	from datasets import load_dataset
	from pprint import pprint

	MODEL_ID = "llava-hf/llava-v1.6-mistral-7b-hf"
	REPO_ID = "avishekjana/llava-v1.6-mistral-7b-FineTuned-custom-docvqa-4bit-0.3"

	processor = AutoProcessor.from_pretrained(MODEL_ID)

	# Define quantization config
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16
	)
	# Load the base model with adapters on top
	model = LlavaNextForConditionalGeneration.from_pretrained(
	REPO_ID,
	torch_dtype=torch.float16,
	quantization_config=quantization_config,
	)

	from PIL import Image
	import requests
	from io import BytesIO

	image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
	response = requests.get(image_url)
	image = Image.open(BytesIO(response.content))

	prompt = f"[INST] <image>\nWhat the lady is holding? [/INST]"
	max_output_token = 256
	inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
	output = model.generate(**inputs, max_new_tokens=max_output_token)
	response = processor.decode(output[0], skip_special_tokens=True)
	response