|
|
|
LLaVa |
|
Overview |
|
LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions. |
|
The LLaVa model was proposed in Visual Instruction Tuning and improved in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. |
|
The abstract from the paper is the following: |
|
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available |
|
|
|
LLaVa architecture. Taken from the original paper. |
|
This model was contributed by ArthurZ and ybelkada. |
|
The original code can be found here. |
|
Usage tips |
|
|
|
We advise users to use padding_side="left" when computing batched generation as it leads to more accurate results. Simply make sure to call processor.tokenizer.padding_side = "left" before generating. |
|
|
|
Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results. |
|
|
|
For better results, we recommend users to prompt the model with the correct prompt format: |
|
|
|
"USER: <image>\n<prompt>ASSISTANT:" |
|
For multiple turns conversation: |
|
|
|
"USER: <image>\n<prompt1>ASSISTANT: <answer1>USER: <prompt2>ASSISTANT: <answer2>USER: <prompt3>ASSISTANT:" |
|
Using Flash Attention 2 |
|
Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the Flash Attention 2 section of performance docs. |
|
Resources |
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT. |
|
|
|
A Google Colab demo on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference. |
|
A similar notebook showcasing batched inference. 🌎 |
|
|
|
LlavaConfig |
|
[[autodoc]] LlavaConfig |
|
LlavaProcessor |
|
[[autodoc]] LlavaProcessor |
|
LlavaForConditionalGeneration |
|
[[autodoc]] LlavaForConditionalGeneration |
|
- forward |