LLaVAction-7B / README.md

Update README.md

4b43e52 verified 4 months ago

7.42 kB

	---
	license: cc-by-nc-sa-4.0
	datasets:
	- lmms-lab/LLaVA-Video-178K
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- lmms-lab/LLaVA-Video-7B-Qwen2
	pipeline_tag: video-text-to-text
	library_name: transformers
	tags:
	- Action
	- Video
	- MQA
	- multimodal
	- VLM
	- LLaVAction
	- MLLMs
	model-index:
	- name: LLaVAction-7B
	results:
	- task:
	type: multimodal
	dataset:
	name: EgoSchema
	type: egoschema
	metrics:
	- type: accuracy
	value: 59
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MVBench
	type: mvbench
	metrics:
	- type: accuracy
	value: 61.1
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: NextQA
	type: nextqa
	metrics:
	- type: accuracy
	value: 82.8
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: PercepTest
	type: percepTest
	metrics:
	- type: accuracy
	value: 70.2
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: LongVideoBench
	type: longvideobench
	metrics:
	- type: accuracy
	value: 58.6
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME
	type: videomme
	metrics:
	- type: accuracy
	value: 63.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME (w-subs)
	type: videomme
	metrics:
	- type: accuracy
	value: 71.4
	name: accuracy
	verified: true
	---

	# LLaVAction-7B

	<div align="center">
	<h2>LLaVAction: evaluating and training multi-modal large language models for action recognition
	</h2>

	[Shaokai Ye](https://yeshaokai.github.io/)<sup>1**</sup>
	[Haozhe Qi](https://people.epfl.ch/haozhe.qi)<sup>1**</sup>

	[Alexander Mathis](https://mathislab.org/)<sup>1</sup><sup>†</sup>
	[Mackenzie Weygandt Mathis](https://www.mackenziemathislab.org/mackenziemathis)<sup>1</sup><sup>†</sup><sup>‡</sup>

	<sup>1</sup> EPFL

	<sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author

	\[[arXiv Paper](arxiv.org/abs/2503.18712)\]   \[[Project Page](https://mmathislab.github.io/llavaction/)\]   \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\]

	</div>

	## Model Summary
	The LLaVAction-7B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens.
	This model supports at most 64 frames.

	- Project Page: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
	- Paper: For more details, please check our [paper](https://arxiv.org/abs/tbd)
	- Repository: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
	- Point of Contact: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
	- Languages: English
	-
	## Useage

	### Intended use
	The model was trained on EPIC-KITCHENS-100-MQA [dataset release pending] and [LLaVA-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K). It has improved capability on understanding human egocentric actions from videos.


	### Generation
	We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction).

	```python
	!pip install llavaction

	from llavaction.model.builder import load_pretrained_model
	from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
	from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
	from llavaction.conversation import conv_templates, SeparatorStyle
	from PIL import Image
	import requests
	import copy
	import torch
	import sys
	import warnings
	from decord import VideoReader, cpu
	import numpy as np
	warnings.filterwarnings("ignore")

	#Your video (it assumes an egocentric view point)
	video_path = "XXXX"

	#These are the prompts we trained with, but you can test others:
	perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
	task_prompt = "Describe in details what you see from the video frames."

	def load_video(video_path, max_frames_num,fps=1,force_sample=False):
	if max_frames_num == 0:
	return np.zeros((1, 336, 336, 3))
	vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
	total_frame_num = len(vr)
	video_time = total_frame_num / vr.get_avg_fps()
	fps = round(vr.get_avg_fps()/fps)
	frame_idx = [i for i in range(0, len(vr), fps)]
	if len(frame_idx) > max_frames_num or force_sample:
	sample_fps = max_frames_num
	uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
	frame_idx = uniform_sampled_frames.tolist()
	frame_time = [i/vr.get_avg_fps() for i in frame_idx]
	spare_frames = vr.get_batch(frame_idx).asnumpy()
	# import pdb;pdb.set_trace()
	return spare_frames,frame_time,video_time

	pretrained = "MLAdaptiveIntelligence/LLaVAction-7B"
	model_name = "llava_qwen"
	device = "cuda"
	device_map = "auto"
	tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
	model.eval()
	max_frames_num = 64
	video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
	video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
	video = [video]
	conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
	time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
	question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"

	conv = copy.deepcopy(conv_templates[conv_template])
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt_question = conv.get_prompt()
	input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

	cont = model.generate(
	input_ids,
	images=video,
	modalities= ["video"],
	do_sample=False,
	temperature=0,
	max_new_tokens=4096,
	)
	text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
	print(text_outputs)
	```


	## Training

	See details in Ye et al. 2025: arxiv.org/abs/2503.18712

	### Model
	- Architecture: SO400M + Qwen2
	- Initialized Model: lmms-lab/LLaVA-Video-7B-Qwen2
	- Data: A mixture of LLaVA-178K and EPIC-KITCHENS-100-MQA, 2 epochs, full model
	- Precision: bfloat16


	### Hardware & Software
	GPUs: 32 * Nvidia GH-200 (for whole model series training)
	Orchestration: HuggingFace Trainer
	Neural networks: PyTorch

	## Citation

	arXiv: arxiv.org/abs/2503.18712

	```bibtex
	@article{YeQi2025llavaction,
	title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
	author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
	journal={arXiv preprint},
	year={2025}
	}
	```