ssl-aasist / fairseq /examples /MMPT /README.md

Add files using upload-large-folder tool

d28af7f verified 5 months ago

9.56 kB

	# VideoCLIP and VLM

	You just find this toolkit for multimodal video understanding! It contains implementation of two recent multi-modal video understanding papers [VideoCLIP](https://arxiv.org/pdf/2109.14084.pdf) (EMNLP, 2021) and [VLM](https://aclanthology.org/2021.findings-acl.370.pdf) (ACL Findings, 2021), along with high-performance toolkits that are typically lacking in existing codebase. The toolkit is desigend to contain generic performance-tuned components that can be potentially adapted to other frameworks (we initially use fairseq).

	VideoCLIP is a contrastive learning model for zero-shot transfer to retrieval/classification/sequence labeling style tasks.

	<img src="videoclip.png" width="350" class="center">

	VLM is a masked language model style pre-training using only one encoder with masked modality model (MMM) for retrieval/generation/sequence labeling style tasks.

	<img src="vlm.png" width="350" class="center">

	### News
	[Oct. 2021] Initial release of implementation for the following papers:
	[VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding](https://arxiv.org/pdf/2109.14084.pdf) (Xu et. al., EMNLP 2021)
	[VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding](https://aclanthology.org/2021.findings-acl.370.pdf) (Xu et. al., ACL Findings 2021)


	### Installation
	We aim to minimize the dependency of this repo on other packages.
	We use fairseq as the main trainer (no models/datasets dependency on fairseq. We will support other trainer in future):
	```
	git clone https://github.com/pytorch/fairseq
	cd fairseq
	pip install -e . # also optionally follow fairseq README for apex installation for fp16 training.
	export MKL_THREADING_LAYER=GNU # fairseq may need this for numpy.
	```

	Then install this toolkit:
	```
	cd examples/MMPT # MMPT can be in any folder, not necessarily under fairseq/examples.
	pip install -e .
	```

	The code is developed under Python=3.8.8, Pytorch=1.8, cuda=11.0 with fairseq=1.0.0a0+af0389f and tested under Python=3.8.8 pytorch=1.9 cuda=11.0 fairseq=1.0.0a0+8e7bc73 during code release.
	Most models require `transformers==3.4` for API compatibility `pip install transformers==3.4`.
	In addition, some downstream tasks may need `conda install pandas`.


	### Usage
	#### Download Checkpoints
	We use pre-trained [S3D](https://github.com/antoine77340/S3D_HowTo100M) for video feature extraction. Please place the models as `pretrained_models/s3d_dict.npy` and `pretrained_models/s3d_howto100m.pth`.

	Download VideoCLIP checkpoint `https://dl.fbaipublicfiles.com/MMPT/retri/videoclip/checkpoint_best.pt` to `runs/retri/videoclip` or VLM checkpoint `https://dl.fbaipublicfiles.com/MMPT/mtm/vlm/checkpoint_best.pt` to `runs/mtm/vlm`.

	#### Demo of Inference
	run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` to get all `.yaml`s for VideoCLIP.

	```python
	import torch

	from mmpt.models import MMPTModel


	model, tokenizer, aligner = MMPTModel.from_pretrained(
	"projects/retri/videoclip/how2.yaml")

	model.eval()


	# B, T, FPS, H, W, C (VideoCLIP is trained on 30 fps of s3d)
	video_frames = torch.randn(1, 2, 30, 224, 224, 3)
	caps, cmasks = aligner._build_text_seq(
	tokenizer("some text", add_special_tokens=False)["input_ids"]
	)

	caps, cmasks = caps[None, :], cmasks[None, :] # bsz=1

	with torch.no_grad():
	output = model(video_frames, caps, cmasks, return_score=True)
	print(output["score"]) # dot-product
	```

	#### Data Preparation
	See [dataset](DATASET.md) for each dataset.

	#### Global Config for Training Pipeline
	We organize a global config file for a training/testing pipeline under projects (see a detailed [explanation](CONFIG.md)). For example, VideoCLIP in `projects/retri/videoclip.yaml` and VLM is in `projects/mtm/vlm.yaml`.

	We wrap all cmds into `locallaunch.py` and `mmpt_cli/localjob.py`. You can check concrete cmds by `--dryrun` and then drop it for actual run.

	First, run `python locallaunch.py projects/retri/videoclip.yaml --dryrun` will generate configs for all configs of pre-training, zero-shot evaluation, fine-tuning and testing, for VideoCLIP under `projects/retri/videoclip`.

	Then each (either training or evaluation) process will be configed by a concrete config file (we save all complex arguments into the concrete config file for reproducibility, including fairseq args). For example, run zero-shot evaluation on youcook,
	```
	python locallaunch.py projects/retri/videoclip/test_youcook_zs.yaml --jobtype local_predict # zero-shot evaluation.
	python locallaunch.py projects/retri/videoclip/youcook_videoclip.yaml --jobtype local_single --dryrun # fine-tuning: use --dryrun to check cmds and drop it to make an actual run; local_small will run on two gpus (as in paper).
	python locallaunch.py projects/retri/videoclip/test_youcook_videoclip.yaml --jobtype local_predict # testing on fine-tuned model.
	```

	Pretraining can be run as:
	```
	python locallaunch.py projects/retri/videoclip/how2.yaml --jobtype local_single --dryrun # check then drop dryrun; paper is ran on local_big as 8 gpus.
	```
	You may need to change `--jobtype`, check/extend `LocalJob` in `mmpt_cli/localjob.py` for multi-gpu/multi-node pre-training.

	The detailed instructions of pretraining and fine-tuning can be found at [pretraining instruction](pretraining.md) and [finetuning instruction](endtask.md).


	### Development
	Several components of this toolkit can be re-used for future research (and also our ongoing research).

	#### Framework Wrapper
	We currently only support fairseq, but most components can be easily fit into other frameworks like huggingface. This repo is a `--user-dir` of fairseq with fairseq wrapper. For example, `mmpt/tasks` includes a `FairseqMMTTask`, which manages `mmpt/datasets` with `FairseqDataset`, `mmpt/models` with `FairseqModel`, `mmpt/losses` with `FairseqCriterion`.

	#### Processors
	Multimodal research introduces the complexity on modality alignment from different input sources to losses. Inspired by [MMF](https://github.com/facebookresearch/mmf), this toolkit leverages `mmpt/processors` to handle various needs of data preprocessing and loading, alleviating the needs of multiple `torch.data.utils.Dataset` (that can be tricky for ablation study).
	Processors can also be decoupled from `torch.data.utils.Dataset` for offline preprocessing instead of on-the-fly data preprocessing.

	We decouple a `mmpt.MMDataset` as 3 types of processors: `MetaProcessor`, `VideoProcessor`, `TextProcessor` and `Aligner`. They can be configed in `dataset` field of a config file (e.g., see `projects/task/how2.yaml`).
	`MetaProcessor` is used to load the meta data about a dataset, aka, all video_ids of how2 dataset.
	`VideoProcessor` is used to load the video features about a dataset. For example, S3D features for each second of a video.
	`TextProcessor` is used to load the text (feature). For example, BERT pre-tokenized text clips for how2 dataset (with `start`s, `end`s of timestamps and `cap` for `token_ids`).
	`Aligner` is the core class for different baselines that prepares the training data. For example, sampling a clip, masking tokens for MLM, etc.

	#### Performance-tuned Components
	To speed up pre-training, this toolkit uses sharded features stored in mmaped numpy, backed by `ShardedTensor` in `mmpt/utils/shardedtensor.py` (adopted from MARGE paper). This reduces the loads of IO for multi-GPU training without loading all features for a video into the memory each time and `ShardedTensor` ensure features are stored in continuous disk space for near random access. This is used for both How2 video features and texts in `mmpt/processors/how2processor.py`.


	### Citation
	If this codebase is useful for your work, please cite the following papers:

	```BibTeX
	@inproceedings{xu-etal-2021-videoclip,
	title = "{VideoCLIP}: Contrastive Pre-training for\\Zero-shot Video-Text Understanding",
	author = "Xu, Hu and
	Ghosh, Gargi and
	Huang, Po-Yao and
	Okhonko, Dmytro and
	Aghajanyan, Armen and
	Metze, Florian and
	Zettlemoyer, Luke and
	Feichtenhofer, Christoph",
	booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
	month = nov,
	year = "2021",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	}

	@inproceedings{xu-etal-2021-vlm,
	title = "{VLM}: Task-agnostic Video-Language Model Pre-training for Video Understanding",
	author = "Xu, Hu and
	Ghosh, Gargi and
	Huang, Po-Yao and
	Arora, Prahal and
	Aminzadeh, Masoumeh and
	Feichtenhofer, Christoph and
	Metze, Florian and
	Zettlemoyer, Luke",
	booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
	month = aug,
	year = "2021",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.findings-acl.370",
	doi = "10.18653/v1/2021.findings-acl.370",
	pages = "4227--4239",
	}
	```

	### Bug Reports
	This repo is in its initial stage, welcome bug reports to huxu@fb.com

	### Copyright
	The majority of Multimodal Pre-training (MMPT) is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Evaluation Codes/Models: Howto100M and HuggingFace Transformers are licensed under the Apache2.0 license; COIN and NLG-eval are licensed under the MIT license; CrossTask is licensed under the BSD-3; DiDeMo is licensed under the BSD-2 license.