Official PyTorch Implementation of SMILE (CVPR 2025).
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem
King Abdullah University of Science and Technology (KAUST)
๐ฐ News
[2025.6.2] Code and pre-trained models are available now!
[2025.5.28] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.
โจ Highlights
๐ฅ State-of-the-art on SSv2 and K400
Our method achieves state-of-the-art performance on SSv2 and K400 benchmarks with a ViT-B backbone, surpassing prior self-supervised video models by up to 2.5%, thanks to efficient CLIP-based semantic supervision.
โก๏ธ Leading Results Across Generalization Challenges
We evaluate our method on the SEVERE benchmark, covering domain shift, low-shot learning, fine-grained actions, and task adaptability. Our model consistently outperforms prior methods and achieves a 3.0% average gain over strong baselines, demonstrating superior generalization in diverse video understanding tasks.
๐ฎ Superior Motion Representation Without Video-Text Alignment
Compared to CLIP-based methods such as ViCLIP and UMT, our model achieves higher accuracy on motion-sensitive datasets, particularly under linear probing. This indicates stronger video representations learned with less data and without relying on video-text alignment.
๐ Main Results and Models
โจ Something-Something V2
Method | Pretrain Dataset | Pretrain Epochs | Backbone | Top-1 | Finetune |
---|---|---|---|---|---|
SMILE | K400 | 800 | ViT-S | 69.1 | TODO |
SMILE | K400 | 600 | ViT-B | 72.1 | log / checkpoint |
SMILE | K400 | 1200 | ViT-B | 72.4 | log / checkpoint |
SMILE | SSv2 | 800 | ViT-B | 72.5 | TODO |
โจ Kinetics-400
Method | Pretrain Dataset | Pretrain Epochs | Backbone | Top-1 | Pretrain | Finetune |
---|---|---|---|---|---|---|
SMILE | K400 | 800 | ViT-S | 79.5 | TODO | TODO |
SMILE | K400 | 600 | ViT-B | 83.1 | checkpoint | log / checkpoint |
SMILE | K400 | 1200 | ViT-B | 83.4 | checkpoint | log / checkpoint |
๐จ Installation
Please follow the instructions in INSTALL.md.
โก๏ธ Data Preparation
We follow VideoMAE Data preparation to prepare our datasets (K400 and SSv2). Here we provide our annotation files for those two datasets: annotation_files. For pretraining, we use training sets (train.csv).
We provide the list of segmented object images used for pretraining in object_instances.txt. The images will be released later.
๐ Pre-training
Following the VideoMAE pre-training guide, we provide scripts for pre-training on the Kinetics-400 (K400) dataset using the ViT-Base model: scripts/pretrain/
As described in the paper, we adopt a two-stage training strategy. Please refer to the script names to identify which stage to run.
If you wish to perform your own pre-training, make sure to update the following parameters in the scripts:
DATA_PATH
: Path to your datasetOUTPUT_DIR
: Directory to save output resultsOBJECTS_PATH
: Path to the overlaying objects image dataset (image data to be released)FIRST_STAGE_CKPT
: Path to the ckpt from first stage pretraining ( for second stage training)
Note: Our pre-training experiments were conducted using 8 V100(32 GB) GPUs.
โคด๏ธ Fine-tuning with Pre-trained Models
Following the VideoMAE finetuning guide, we provide scripts for fine-tuning on the Something-Something v2 (SSv2) and Kinetics-400 (K400) datasets using the ViT-Base model: scripts/finetune/
To perform your own fine-tuning, please update the following parameters in the script:
DATA_PATH
: Path to your datasetMODEL_PATH
: Path to the pre-trained modelOUTPUT_DIR
: Directory to save output results
Note: Our finetuning experiments were conducted using 4 V100(32 GB) GPUs.
โ๏ธ Contact
Fida Mohammad Thoker: fida.thoker@kaust.edu.sa
๐ Acknowledgements
We sincerely thank Michael Dorkenwald for providing the object image dataset that supports this work.
This project is built upon VideoMAE and tubelet-contrast. Thanks to the contributors of these great codebases.
๐ License
This project is released under the MIT license. For more details, please refer to the LICENSE file.
โ๏ธ Citation
If you think this project is helpful, please feel free to leave a starโญ๏ธ and cite our paper:
@inproceedings{thoker2025smile,
author = {Thoker, Fida Mohammad and Jiang, Letian and Zhao, Chen and Ghanem, Bernard},
title = {SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning},
journal = {CVPR},
year = {2025},
}