SD v2-1-base, trained with realigned covariances (DFT-colored noise)

This repository contains a version of Stable Diffusion v2.1 base adapted with realigned covariances, using colored noise instead of white noise. The weights are initialized from the pretrained model (Stable Diffusion v2.1 base), and training was done on a 100,000-sample subset of the Re-LAION-5B research-safe dataset.

This model is intended for academic research use only and is not suitable for production deployment.

Usage

from diffusers import StableDiffusionPipeline
pretrained_model_name_or_path = "EPFL-IVRL/sd2.1-base-colorednoiseDFT"
pipe = StableDiffusionPipeline.from_pretrained(pretrained_model_name_or_path).to("cuda")

import torch
from diffusers.utils import _get_model_file
from safetensors.torch import load_file

prompt = "An astronaut riding a horse."

stats = load_file(_get_model_file(pretrained_model_name_or_path, weights_name="stats.safetensors", subfolder="initial_noise_loader"))
variance_spectrum = stats["variance_spectrum_vae64_dft"].to("cuda")

generator = torch.manual_seed(123456)
initial_noise = torch.randn((1, 4, 64, 64), generator=generator).to("cuda")

dft = torch.fft.fftshift(torch.fft.fftn(initial_noise, dim=(-2, -1), norm="ortho"), dim=(-2, -1))
dft *= torch.sqrt(variance_spectrum)
initial_noise = torch.real(torch.fft.ifftn(torch.fft.ifftshift(dft, dim=(-2, -1)), dim=(-2, -1), norm="ortho"))

pipe(prompt, latents=initial_noise).images[0].show()

Model Description

Model type: Diffusion-based text-to-image generation model
Language(s): English
License: This model is meant for research academic use only, not for production use. See EPFL source code academic license. The pretrained model Stable Diffusion v2.1 base is licensed under CreativeML Open RAIL++-M License.
Adapted from model: Stable Diffusion v2.1 base
Resources for more information: Project page GitHub Repository
Cite as:

Citation

@article{everaert2024covariancemismatch,
    author   = {Everaert, Martin Nicolas and Süsstrunk, Sabine and Achanta, Radhakrishna},
    title    = {{C}ovariance {M}ismatch in {D}iffusion {M}odSels}, 
    journal  = {Infoscience preprint Infoscience:20.500.14299/242173},
    month    = {November},
    year     = {2024},
}

Training details

Dataset size: 100k image-caption pairs from Re-LAION-5B research-safe
Hardware: 1 × NVIDIA A100-SXM4-80GB
Training Time: 9h55min
Pretrained model: Stable Diffusion v2.1 base
Covariance realignment method:
- original data (without data whitening)
- colored noise (DFT approximation)
- no reweighting of components in the loss
Optimizer: AdamW (32-bit, no quantization)
- betas: (0.9, 0.999)
- weight_decay: 0.01
- eps: 1e-08
- lr: Constant 1e-05
Batch size: 32 (no gradient accumulation)
Caption dropout: 10%
Exponential Moving Average (EMA) decay: 0.99
Training steps: 20,000 (intermediate checkpoint at training step 10,000 in the unet_10000 subfolder)
Training range of noise levels:
- Same noise scheduler as Stable Diffusion v2.1 base, i.e. $SNR \in [0.0047, 1175.4403]$
Training loss:
- Logs

EPFL-IVRL
/

sd2.1-base-colorednoiseDFT

You need to agree to share your contact information to access this model

SD v2-1-base, trained with realigned covariances (DFT-colored noise)

Usage

Model Description

Citation

Training details

Collection including EPFL-IVRL/sd2.1-base-colorednoiseDFT

Covariance Mismatch in Diffusion Models