nielsr HF Staff commited on
Commit
74c82f8
Β·
verified Β·
1 Parent(s): 563df15

Add model card metadata

Browse files

This PR adds missing metadata to the model card, including the pipeline tag, library name, and license. This improves discoverability and clarity for users.

Files changed (1) hide show
  1. README.md +210 -0
README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: video-to-video
3
+ library_name: diffusers
4
+ license: mit
5
+ ---
6
+
7
+ # πŸŽ₯ FAR: Frame Autoregressive Model for Both Short- and Long-Context Video Modeling πŸš€
8
+
9
+ <div align="center">
10
+
11
+ [![Project Page](https://img.shields.io/badge/Project-Website-orange)](https://farlongctx.github.io/)
12
+ [![arXiv](https://img.shields.io/badge/arXiv-2503.19325-b31b1b.svg)](https://arxiv.org/abs/2503.19325)&nbsp;
13
+ [![huggingface weights](https://img.shields.io/badge/%F0%9F%A4%97%20Weights-FAR-yellow)](https://huggingface.co/guyuchao/FAR_Models)&nbsp;
14
+ [![SOTA](https://img.shields.io/badge/State%20of%20the%20Art-Video%20Generation%20-32B1B4?logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iNjA2IiBoZWlnaHQ9IjYwNiIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxuczp4bGluaz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94bGluayIgb3ZlcmZsb3c9ImhpZGRlbiI%2BPGRlZnM%2BPGNsaXBQYXRoIGlkPSJjbGlwMCI%2BPHJlY3QgeD0iLTEiIHk9Ii0xIiB3aWR0aD0iNjA2IiBoZWlnaHQ9IjYwNiIvPjwvY2xpcFBhdGg%2BPC9kZWZzPjxnIGNsaXAtcGF0aD0idXJsKCNjbGlwMCkiIHRyYW5zbGF0ZSgxIDEpIj48cmVjdCB4PSI1MjkiIHk9IjY2IiB3aWR0aD0iNTYiIGhlaWdodD0iNDczIiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iMTkiIHk9IjY2IiB3aWR0aD0iNTciIGhlaWdodD0iNDczIiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iMjc0IiB5PSIxNTEiIHdpZHRoPSI1NyIgaGVpZ2h0PSIzMDIiIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSIxMDQiIHk9IjE1MSIgd2lkdGg9IjU3IiBoZWlnaHQ9IjMwMiIgZmlsbD0iIzQ0RjJGNiIvPjxyZWN0IHg9IjQ0NCIgeT0iMTUxIiB3aWR0aD0iNTciIGhlaWdodD0iMzAyIiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iMzU5IiB5PSIxNzAiIHdpZHRoPSI1NiIgaGVpZ2h0PSIyNjQiIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSIxODgiIHk9IjE3MCIgd2lkdGg9IjU3IiBoZWlnaHQ9IjI2NCIgZmlsbD0iIzQ0RjJGNiIvPjxyZWN0IHg9Ijc2IiB5PSI2NiIgd2lkdGg9IjQ3IiBoZWlnaHQ9IjU3IiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iNDgyIiB5PSI2NiIgd2lkdGg9IjQ3IiBoZWlnaHQ9IjU3IiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iNzYiIHk9IjQ4MiIgd2lkdGg9IjQ3IiBoZWlnaHQ9IjU3IiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iNDgyIiB5PSI0ODIiIHdpZHRoPSI0NyIgaGVpZ2h0PSI1NyIgZmlsbD0iIzQ0RjJGNiIvPjwvZz48L3N2Zz4%3D)](https://paperswithcode.com/sota/video-generation-on-ucf-101)
15
+
16
+ </div>
17
+
18
+ <p align="center" style="font-size: larger;">
19
+ <a href="https://arxiv.org/abs/2503.19325">Long-Context Autoregressive Video Modeling with Next-Frame Prediction</a>
20
+ </p>
21
+
22
+ ![dmlab_sample](./assets/dmlab_sample.png)
23
+
24
+ ## πŸ“’ News
25
+
26
+ * **2025-03:** Paper and Code of [FAR](https://farlongctx.github.io/) are released! πŸŽ‰
27
+
28
+
29
+ ## 🌟 What's the Potential of FAR?
30
+
31
+ ### πŸ”₯ Introducing FAR: a new baseline for autoregressive video generation
32
+
33
+ FAR (i.e., <u>**F**</u>rame <u>**A**</u>uto<u>**R**</u>egressive Model) learns to predict continuous frames based on an autoregressive context. Its objective aligns well with video modeling, similar to the next-token prediction in language modeling.
34
+
35
+ ![dmlab_sample](./assets/pipeline.png)
36
+
37
+ ### πŸ”₯ FAR achieves better convergence than video diffusion models with the same continuous latent space
38
+
39
+ <p align="center">
40
+ <img src="./assets/converenge.jpg" width=55%>
41
+ <p>
42
+
43
+ ### πŸ”₯ FAR leverages clean visual context without additional image-to-video fine-tuning:
44
+
45
+ Unconditional pretraining on UCF-101 achieves state-of-the-art results in both video generation (context frame = 0) and video prediction (context frame β‰₯ 1) within a single model.
46
+
47
+ <p align="center">
48
+ <img src="./assets/performance.png" width=75%>
49
+ <p>
50
+
51
+ ### πŸ”₯ FAR supports 16x longer temporal extrapolation at test time
52
+
53
+ <p align="center">
54
+ <img src="./assets/extrapolation.png" width=100%>
55
+ <p>
56
+
57
+ ### πŸ”₯ FAR supports efficient training on long-video sequence with managable token lengths
58
+
59
+ <p align="center">
60
+ <img src="./assets/long_short_term_ctx.jpg" width=55%>
61
+ <p>
62
+
63
+ #### πŸ“š For more details, check out our [paper](https://arxiv.org/abs/2503.19325).
64
+
65
+
66
+ ## πŸ‹οΈβ€β™‚οΈ FAR Model Zoo
67
+ We provide trained FAR models in our paper for re-implementation.
68
+
69
+ ### Video Generation
70
+
71
+ We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of [Latte](https://arxiv.org/abs/2401.03048):
72
+
73
+ | Model (Config) | #Params | Resolution | Condition | FVD | HF Weights | Pre-Computed Samples |
74
+ |:-------:|:------------:|:------------:|:-----------:|:-----:|:----------:|:----------:|
75
+ | [FAR-L](options/train/far/video_generation/FAR_L_ucf101_uncond_res128_400K_bs32.yml) | 457 M | 128x128 | βœ— | 280 Β± 11.7 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Uncond128-c19abd2c.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
76
+ | [FAR-L](options/train/far/video_generation/FAR_L_ucf101_cond_res128_400K_bs32.yml) | 457 M | 128x128 | βœ“ | 99 Β± 5.9 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Cond128-c6f798bf.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
77
+ | [FAR-L](options/train/far/video_generation/FAR_L_ucf101_uncond_res256_400K_bs32.yml) | 457 M | 256x256 | βœ— | 303 Β± 13.5 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Uncond256-adea51e9.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
78
+ | [FAR-L](options/train/far/video_generation/FAR_L_ucf101_cond_res256_400K_bs32.yml) | 457 M | 256x256 | βœ“ | 113 Β± 3.6 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_L_UCF101_Cond256-41c6033f.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
79
+ | [FAR-XL](options/train/far/video_generation/FAR_XL_ucf101_uncond_res256_400K_bs32.yml) | 657 M | 256x256 | βœ— | 279 Β± 9.2 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_XL_UCF101_Uncond256-3594ce6b.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
80
+ | [FAR-XL](options/train/far/video_generation/FAR_XL_ucf101_cond_res256_400K_bs32.yml) | 657 M | 256x256 | βœ“ | 108 Β± 4.2 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/video_generation/FAR_XL_UCF101_Cond256-28a88f56.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
81
+
82
+ ### Short-Video Prediction
83
+
84
+ We follows the evaluation prototype of [MCVD](https://arxiv.org/abs/2205.09853) and [ExtDM](https://openaccess.thecvf.com/content/CVPR2024/papers/Zhang_ExtDM_Distribution_Extrapolation_Diffusion_Model_for_Video_Prediction_CVPR_2024_paper.pdf):
85
+
86
+ | Model (Config) | #Params | Dataset | PSNR | SSIM | LPIPS | FVD | HF Weights | Pre-Computed Samples |
87
+ |:-----:|:------------:|:------------:|:-----:|:-----:|:-----:|:-----:|:----------:|:----------:|
88
+ | [FAR-B](options/train/far/short_video_prediction/FAR_B_ucf101_res64_200K_bs32.yml) | 130 M | UCF101 | 25.64 | 0.818 | 0.037 | 194.1 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/short_video_prediction/FAR_B_UCF101_Uncond64-381d295f.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
89
+ | [FAR-B](options/train/far/short_video_prediction/FAR_B_bair_res64_200K_bs32.yml) | 130 M | BAIR (c=2, p=28) | 19.40 | 0.819 | 0.049 | 144.3 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/short_video_prediction/FAR_B_BAIR_Uncond64-1983191b.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
90
+
91
+ ### Long-Video Prediction
92
+
93
+ We use seed-[0,2,4,6] in evaluation, following the evaluation prototype of [TECO](https://arxiv.org/abs/2210.02396):
94
+
95
+
96
+ | Model (Config) | #Params | Dataset | PSNR | SSIM | LPIPS | FVD | HF Weights | Pre-Computed Samples |
97
+ |:-----:|:------------:|:------------:|:-----:|:-----:|:-----:|:-----:|:----------:|:----------:|
98
+ | [FAR-B-Long](options/train/far/long_video_prediction/FAR_B_Long_dmlab_res64_400K_bs32.yml) | 150 M | DMLab | 22.3 | 0.687 | 0.104 | 64 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/long_video_prediction/FAR_B_Long_DMLab_Action64-c09441dc.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
99
+ | [FAR-M-Long](options/train/far/long_video_prediction/FAR_M_Long_minecraft_res128_400K_bs32.yml) | 280 M | Minecraft | 16.9 | 0.448 | 0.251 | 39 | [Model-HF](https://huggingface.co/guyuchao/FAR_Models/resolve/main/long_video_prediction/FAR_M_Long_Minecraft_Action128-4c041561.pth) | [Google Drive](https://drive.google.com/drive/folders/1p1MvCiTfoUYAUYNqQNG6nEU02zy8U1vp?usp=drive_link) |
100
+
101
+ ## πŸ”§ Dependencies and Installation
102
+
103
+ ### 1. Setup Environment:
104
+
105
+ ```bash
106
+ # Setup Conda Environment
107
+ conda create -n FAR python=3.10
108
+ conda activate FAR
109
+
110
+ # Install Pytorch
111
+ conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=12.4 -c pytorch -c nvidia
112
+
113
+ # Install Other Dependences
114
+ pip install -r requirements.txt
115
+ ```
116
+
117
+ ### 2. Prepare Dataset:
118
+
119
+ We have uploaded the dataset used in this paper to Hugging Face datasets for faster download. Please follow the instructions below to prepare.
120
+
121
+ ```python
122
+ from huggingface_hub import snapshot_download, hf_hub_download
123
+
124
+ dataset_url = {
125
+ "ucf101": "guyuchao/UCF101",
126
+ "bair": "guyuchao/BAIR",
127
+ "minecraft": "guyuchao/Minecraft",
128
+ "minecraft_latent": "guyuchao/Minecraft_Latent",
129
+ "dmlab": "guyuchao/DMLab",
130
+ "dmlab_latent": "guyuchao/DMLab_Latent"
131
+ }
132
+
133
+ for key, url in dataset_url.items():
134
+ snapshot_download(
135
+ repo_id=url,
136
+ repo_type="dataset",
137
+ local_dir=f"datasets/{key}",
138
+ token="input your hf token here"
139
+ )
140
+ ```
141
+
142
+ Then, enter its directory and execute:
143
+
144
+ ```bash
145
+ find . -name "shard-*.tar" -exec tar -xvf {} \;
146
+ ```
147
+
148
+
149
+ ### 3. Prepare Pretrained Models of FAR:
150
+
151
+ We have uploaded the pretrained models of FAR to Hugging Face models. Please follow the instructions below to download if you want to evaluate FAR.
152
+
153
+ ```bash
154
+ from huggingface_hub import snapshot_download, hf_hub_download
155
+
156
+ for key, url in dataset_url.items():
157
+ snapshot_download(
158
+ repo_id="guyuchao/FAR_Models",
159
+ repo_type="model",
160
+ local_dir="experiments/pretrained_models/FAR_Models",
161
+ token="input your hf token here"
162
+ )
163
+ ```
164
+
165
+ ## πŸš€ Training
166
+
167
+ To train different models, you can run the following command:
168
+
169
+ ```bash
170
+ accelerate launch \
171
+ --num_processes 8 \
172
+ --num_machines 1 \
173
+ --main_process_port 19040 \
174
+ train.py \
175
+ -opt train_config.yml
176
+ ```
177
+
178
+ * **Wandb:** Set ```use_wandb``` to ```True``` in config to enable wandb monitor.
179
+ * **Periodally Evaluation:** Set ```val_freq``` to control the peroidly evaluation in training.
180
+ * **Auto Resume:** Directly rerun the script, the model will find the lastest checkpoint to resume, the wandb log will automatically resume.
181
+ * **Efficient Training on Pre-Extracted Latent:** Set ```use_latent``` to ```True```, and set the ```data_list``` to correponding latent path list.
182
+
183
+ ## πŸ’» Sampling & Evaluation
184
+
185
+ To evaluate the performance of a pretrained model, just copy the training config and set the ```pretrain_network: ~``` to your trained folder. Then run the following scripts:
186
+
187
+
188
+ ```bash
189
+ accelerate launch \
190
+ --num_processes 8 \
191
+ --num_machines 1 \
192
+ --main_process_port 10410 \
193
+ test.py \
194
+ -opt test_config.yml
195
+ ```
196
+
197
+ ## πŸ“œ License
198
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
199
+
200
+
201
+ ## πŸ“– Citation
202
+ If our work assists your research, feel free to give us a star ⭐ or cite us using:
203
+ ```
204
+ @article{gu2025long,
205
+ title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
206
+ author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
207
+ journal={arXiv preprint arXiv:2503.19325},
208
+ year={2025}
209
+ }
210
+ ```