|
--- |
|
license: apache-2.0 |
|
tags: |
|
- diffusion |
|
- image-to-image |
|
- depth-estimation |
|
- optical-flow |
|
- amodal-segmentation |
|
--- |
|
|
|
# Scaling Properties of Diffusion Models for Perceptual Tasks |
|
|
|
### CVPR 2025 |
|
|
|
**Rahul Ravishankar\*, Zeeshan Patel\*, Jathushan Rajasegaran, Jitendra Malik** |
|
|
|
[[Paper](https://arxiv.org/abs/2411.08034)] 路 [[Project Page](https://scaling-diffusion-perception.github.io/)] |
|
|
|
|
|
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. |
|
|
|
|
|
## Getting started |
|
|
|
You can download our DiT-MoE Generalist model [here](https://huggingface.co/zeeshanp/scaling_diffusion_perception/blob/main/dit_moe_generalist.pt). Please see instructions on how to use our model in the [GitHub README](https://github.com/scaling-diffusion-perception/scaling-diffusion-perception)路 |