|
--- |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- lmms-lab/LLaVA-Video-178K |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- lmms-lab/LLaVA-Video-7B-Qwen2 |
|
pipeline_tag: video-text-to-text |
|
library_name: transformers |
|
tags: |
|
- Action |
|
- Video |
|
- MQA |
|
- multimodal |
|
- VLM |
|
- LLaVAction |
|
- MLLMs |
|
model-index: |
|
- name: LLaVAction-7B |
|
results: |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: EgoSchema |
|
type: egoschema |
|
metrics: |
|
- type: accuracy |
|
value: 59 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: MVBench |
|
type: mvbench |
|
metrics: |
|
- type: accuracy |
|
value: 61.1 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: NextQA |
|
type: nextqa |
|
metrics: |
|
- type: accuracy |
|
value: 82.8 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: PercepTest |
|
type: percepTest |
|
metrics: |
|
- type: accuracy |
|
value: 70.2 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: LongVideoBench |
|
type: longvideobench |
|
metrics: |
|
- type: accuracy |
|
value: 58.6 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: VideoMME |
|
type: videomme |
|
metrics: |
|
- type: accuracy |
|
value: 63.9 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: VideoMME (w-subs) |
|
type: videomme |
|
metrics: |
|
- type: accuracy |
|
value: 71.4 |
|
name: accuracy |
|
verified: true |
|
--- |
|
|
|
# LLaVAction-7B |
|
|
|
<div align="center"> |
|
<h2>LLaVAction: evaluating and training multi-modal large language models for action recognition |
|
</h2> |
|
|
|
[Shaokai Ye](https://yeshaokai.github.io/)<sup>1**</sup> |
|
[Haozhe Qi](https://people.epfl.ch/haozhe.qi)<sup>1**</sup> |
|
|
|
[Alexander Mathis](https://mathislab.org/)<sup>1</sup><sup>†</sup> |
|
[Mackenzie Weygandt Mathis](https://www.mackenziemathislab.org/mackenziemathis)<sup>1</sup><sup>†</sup><sup>‡</sup> |
|
|
|
<sup>1</sup> EPFL |
|
|
|
<sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author |
|
|
|
\[[arXiv Paper](arxiv.org/abs/2503.18712)\] \[[Project Page](https://mmathislab.github.io/llavaction/)\] \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] |
|
|
|
</div> |
|
|
|
## Model Summary |
|
The LLaVAction-7B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens. |
|
This model supports at most 64 frames. |
|
|
|
- **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/) |
|
- **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd) |
|
- **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction) |
|
- **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis) |
|
- **Languages**: English |
|
- |
|
## Useage |
|
|
|
### Intended use |
|
The model was trained on EPIC-KITCHENS-100-MQA [dataset release pending] and [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). It has improved capability on understanding human egocentric actions from videos. |
|
|
|
|
|
### Generation |
|
We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction). |
|
|
|
```python |
|
!pip install llavaction |
|
|
|
from llavaction.model.builder import load_pretrained_model |
|
from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token |
|
from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX |
|
from llavaction.conversation import conv_templates, SeparatorStyle |
|
from PIL import Image |
|
import requests |
|
import copy |
|
import torch |
|
import sys |
|
import warnings |
|
from decord import VideoReader, cpu |
|
import numpy as np |
|
warnings.filterwarnings("ignore") |
|
|
|
#Your video (it assumes an egocentric view point) |
|
video_path = "XXXX" |
|
|
|
#These are the prompts we trained with, but you can test others: |
|
perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?" |
|
task_prompt = "Describe in details what you see from the video frames." |
|
|
|
def load_video(video_path, max_frames_num,fps=1,force_sample=False): |
|
if max_frames_num == 0: |
|
return np.zeros((1, 336, 336, 3)) |
|
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1) |
|
total_frame_num = len(vr) |
|
video_time = total_frame_num / vr.get_avg_fps() |
|
fps = round(vr.get_avg_fps()/fps) |
|
frame_idx = [i for i in range(0, len(vr), fps)] |
|
if len(frame_idx) > max_frames_num or force_sample: |
|
sample_fps = max_frames_num |
|
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int) |
|
frame_idx = uniform_sampled_frames.tolist() |
|
frame_time = [i/vr.get_avg_fps() for i in frame_idx] |
|
spare_frames = vr.get_batch(frame_idx).asnumpy() |
|
# import pdb;pdb.set_trace() |
|
return spare_frames,frame_time,video_time |
|
|
|
pretrained = "MLAdaptiveIntelligence/LLaVAction-7B" |
|
model_name = "llava_qwen" |
|
device = "cuda" |
|
device_map = "auto" |
|
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args |
|
model.eval() |
|
max_frames_num = 64 |
|
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True) |
|
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16) |
|
video = [video] |
|
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models |
|
time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. " |
|
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}" |
|
|
|
conv = copy.deepcopy(conv_templates[conv_template]) |
|
conv.append_message(conv.roles[0], question) |
|
conv.append_message(conv.roles[1], None) |
|
prompt_question = conv.get_prompt() |
|
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) |
|
|
|
cont = model.generate( |
|
input_ids, |
|
images=video, |
|
modalities= ["video"], |
|
do_sample=False, |
|
temperature=0, |
|
max_new_tokens=4096, |
|
) |
|
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip() |
|
print(text_outputs) |
|
``` |
|
|
|
|
|
## Training |
|
|
|
See details in Ye et al. 2025: arxiv.org/abs/2503.18712 |
|
|
|
### Model |
|
- **Architecture**: SO400M + Qwen2 |
|
- **Initialized Model**: lmms-lab/LLaVA-Video-7B-Qwen2 |
|
- **Data**: A mixture of LLaVA-178K and EPIC-KITCHENS-100-MQA, 2 epochs, full model |
|
- **Precision**: bfloat16 |
|
|
|
|
|
### Hardware & Software |
|
GPUs: 32 * Nvidia GH-200 (for whole model series training) |
|
Orchestration: HuggingFace Trainer |
|
Neural networks: PyTorch |
|
|
|
## Citation |
|
|
|
arXiv: arxiv.org/abs/2503.18712 |
|
|
|
```bibtex |
|
@article{YeQi2025llavaction, |
|
title={LLaVAction: evaluating and training multi-modal large language models for action recognition}, |
|
author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.}, |
|
journal={arXiv preprint}, |
|
year={2025} |
|
} |
|
``` |