Spaces:

sakinlesh
/

deneme

Configuration error

App Files Files Community

sakinlesh commited on Jan 23

Commit

dd06d6b

verified ·

1 Parent(s): 09edcbe

Upload 25 files

Browse files

Files changed (26) hide show

.gitattributes +1 -0
LICENSE +21 -0
README.md +101 -14
ball2envmap.py +152 -0
environment.yml +8 -0
example/bed.png +3 -0
exposure2hdr.py +139 -0
inpaint.py +363 -0
models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/optimizer.bin +3 -0
models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/pytorch_lora_weights.safetensors +3 -0
models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/random_states_0.pkl +3 -0
models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/scheduler.bin +3 -0
relighting/argument.py +43 -0
relighting/ball_processor.py +60 -0
relighting/dataset.py +412 -0
relighting/dist_utils.py +154 -0
relighting/image_processor.py +141 -0
relighting/inpainter.py +424 -0
relighting/mask_utils.py +124 -0
relighting/pipeline.py +344 -0
relighting/pipeline_inpaintonly.py +613 -0
relighting/pipeline_utils.py +185 -0
relighting/pipeline_xl.py +482 -0
relighting/tonemapper.py +33 -0
relighting/utils.py +56 -0
requirements.txt +25 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+example/bed.png filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Pakkapon Phongthawee
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,14 +1,101 @@
----
-title: Deneme
-emoji: 👁
-colorFrom: blue
-colorTo: green
-sdk: gradio
-sdk_version: 5.13.0
-app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: deneme deneme
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# DiffusionLight: Light Probes for Free by Painting a Chrome Ball
+### [Project Page](https://diffusionlight.github.io/) | [Paper](https://arxiv.org/abs/2312.09168) | [Colab](https://colab.research.google.com/drive/15pC4qb9mEtRYsW3utXkk-jnaeVxUy-0S?usp=sharing&sandboxMode=true) | [HuggingFace](https://huggingface.co/DiffusionLight/DiffusionLight)
+[![Open DiffusionLight in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/15pC4qb9mEtRYsW3utXkk-jnaeVxUy-0S?usp=sharing&sandboxMode=true)
+![](https://diffusionlight.github.io/assets/images/thumbnail.jpg)
+ We present a simple yet effective technique to estimate lighting in a single input image. Current techniques rely heavily on HDR panorama datasets to train neural networks to regress an input with limited field-of-view to a full environment map. However, these approaches often struggle with real-world, uncontrolled settings due to the limited diversity and size of their datasets. To address this problem, we leverage diffusion models trained on billions of standard images to render a chrome ball into the input image. Despite its simplicity, this task remains challenging: the diffusion models often insert incorrect or inconsistent objects and cannot readily generate images in HDR format. Our research uncovers a surprising relationship between the appearance of chrome balls and the initial diffusion noise map, which we utilize to consistently generate high-quality chrome balls. We further fine-tune an LDR diffusion model (Stable Diffusion XL) with LoRA, enabling it to perform exposure bracketing for HDR light estimation. Our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios.
+ ## Table of contents
+-----
+  * [TL;DR](#Getting-started)
+  * [Installation](#Installation)
+  * [Prediction](#Prediction)
+  * [Evaluation](#Evaluation)
+  * [Citation](#Citation)
+------
+## Getting started
+```shell
+conda env create -f environment.yml
+conda activate diffusionlight
+pip install -r requirements.txt
+python inpaint.py --dataset example --output_dir output
+python ball2envmap.py --ball_dir output/square --envmap_dir output/envmap
+python exposure2hdr.py --input_dir output/envmap --output_dir output/hdr
+```
+## Installation
+To setup the Python environment, you need to run the following commands in both Conda and pip:
+```shell
+conda env create -f environment.yml
+conda activate diffusionlight
+pip install -r requirements.txt
+```
+Note that Conda is optional. However, if you choose not to use Conda, you must manually install CUDA-toolkit and OpenEXR.
+## Prediction
+### 0. Preparing the image
+Please resize the input image to 1024x1024. If the image is not square, we recommend padding it with a black border.
+### 1. Inpainting the chrome ball
+First, we predict the chrome ball in different exposure values (EV) using the following command:
+```shell
+python inpaint.py --dataset <input_directory> --output_dir <output_directory>
+```
+This command outputs three subdirectories:  `control`, `raw`, and  `square`
+The contents of each directory are:
+- `control`: Conditioned depth map
+- `raw`: Inpainted image with a chrome ball in the center
+- `square`: Square-cropped chrome ball (used for the next step)
+### 2. Projecting a ball into an environment map
+Next, we project the chrome ball from the previous step to the LDR environment map using the following command:
+```shell
+python ball2envmap.py --ball_dir <output_directory>/square --envmap_dir <output_directory>/envmap
+```
+### 3. Compose HDR image
+Finally, we compose an HDR image from multiple LDR environment maps using our custom exposure bracketing:
+```shell
+python exposure2hdr.py --input_dir <output_directory>/envmap --output_dir <output_directory>/hdr
+```
+The predicted light estimation will be located at `<output_directory>/hdr` and can be used for downstream tasks such as object insertion. We will also use it to compare with other methods.
+## Evaluation
+We use the evaluation code from [StyleLight](https://style-light.github.io/) and [Editable Indoor LightEstimation](https://lvsn.github.io/EditableIndoorLight/). You can use their code to measure our score.
+Additionally, we provide a *slightly* modified version of the evaluation code at [DiffusionLight-evaluation](https://github.com/DiffusionLight/DiffusionLight-evaluation) including the test input.
+## Citation
+```
+@inproceedings{Phongthawee2023DiffusionLight,
+    author = {Phongthawee, Pakkapon and Chinchuthakun, Worameth and Sinsunthithet, Nontaphat and Raj, Amit and Jampani, Varun and Khungurn, Pramook and Suwajanakorn, Supasorn},
+    title = {DiffusionLight: Light Probes for Free by Painting a Chrome Ball},
+    booktitle = {ArXiv},
+    year = {2023},
+}
+```
+## Visit us 🦉
+[![Vision & Learning Laboratory](https://i.imgur.com/hQhkKhG.png)](https://vistec.ist/vision) [![VISTEC - Vidyasirimedhi Institute of Science and Technology](https://i.imgur.com/4wh8HQd.png)](https://vistec.ist/)

ball2envmap.py ADDED Viewed

	@@ -0,0 +1,152 @@

+# convert the ball to environment map, lat, long format
+import numpy as np
+from PIL import Image
+import skimage
+import time
+import torch
+import argparse
+from multiprocessing import Pool
+from functools import partial
+from tqdm.auto import tqdm
+import os
+try:
+    import ezexr
+except:
+    pass
+def create_argparser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--ball_dir", type=str, required=True ,help='directory that contain the image')
+    parser.add_argument("--envmap_dir", type=str, required=True ,help='directory to output environment map') #dataset name or directory
+    parser.add_argument("--envmap_height", type=int, default=256, help="size of the environment map height in pixel (height)")
+    parser.add_argument("--scale", type=int, default=4, help="scale factor")
+    parser.add_argument("--threads", type=int, default=8, help="num thread for pararell processing")
+    return parser
+def create_envmap_grid(size: int):
+    """
+    BLENDER CONVENSION
+    Create the grid of environment map that contain the position in sperical coordinate
+    Top left is (0,0) and bottom right is (pi/2, 2pi)
+    """
+    theta = torch.linspace(0, np.pi * 2, size * 2)
+    phi = torch.linspace(0, np.pi, size)
+    #use indexing 'xy' torch match vision's homework 3
+    theta, phi = torch.meshgrid(theta, phi ,indexing='xy')
+    theta_phi = torch.cat([theta[..., None], phi[..., None]], dim=-1)
+    theta_phi = theta_phi.numpy()
+    return theta_phi
+def get_normal_vector(incoming_vector: np.ndarray, reflect_vector: np.ndarray):
+    """
+    BLENDER CONVENSION
+    incoming_vector: the vector from the point to the camera
+    reflect_vector: the vector from the point to the light source
+    """
+    #N = 2(R ⋅ I)R - I
+    N = (incoming_vector + reflect_vector) / np.linalg.norm(incoming_vector + reflect_vector, axis=-1, keepdims=True)
+    return N
+def get_cartesian_from_spherical(theta: np.array, phi: np.array, r = 1.0):
+    """
+    BLENDER CONVENSION
+    theta: vertical angle
+    phi: horizontal angle
+    r: radius
+    """
+    x = r * np.sin(theta) * np.cos(phi)
+    y = r * np.sin(theta) * np.sin(phi)
+    z = r * np.cos(theta)
+    return np.concatenate([x[...,None],y[...,None],z[...,None]], axis=-1)
+def process_image(args: argparse.Namespace, file_name: str):
+    I = np.array([1,0, 0])
+    # check if exist, skip!
+    envmap_output_path = os.path.join(args.envmap_dir, file_name)
+    if os.path.exists(envmap_output_path):
+        return None
+    # read ball image
+    ball_path = os.path.join(args.ball_dir, file_name)
+    if file_name.endswith(".exr"):
+        ball_image = ezexr.imread(ball_path)
+    else:
+        try:
+            ball_image = skimage.io.imread(ball_path)
+            ball_image = skimage.img_as_float(ball_image)
+        except:
+            return None
+    # compute  normal map that create from reflect vector
+    env_grid = create_envmap_grid(args.envmap_height * args.scale)
+    reflect_vec = get_cartesian_from_spherical(env_grid[...,1], env_grid[...,0])
+    normal = get_normal_vector(I[None,None], reflect_vec)
+    # turn from normal map to position to lookup [Range: 0,1]
+    pos = (normal + 1.0) / 2
+    pos  = 1.0 - pos
+    pos = pos[...,1:]
+    env_map = None
+    # using pytorch method for bilinear interpolation
+    with torch.no_grad():
+        # convert position to pytorch grid look up
+        grid = torch.from_numpy(pos)[None].float()
+        grid = grid * 2 - 1 # convert to range [-1,1]
+        # convert ball to support pytorch
+        ball_image = torch.from_numpy(ball_image[None]).float()
+        ball_image = ball_image.permute(0,3,1,2) # [1,3,H,W]
+        env_map = torch.nn.functional.grid_sample(ball_image, grid, mode='bilinear', padding_mode='border', align_corners=True)
+        env_map = env_map[0].permute(1,2,0).numpy()
+    env_map_default = skimage.transform.resize(env_map, (args.envmap_height, args.envmap_height*2), anti_aliasing=True)
+    if file_name.endswith(".exr"):
+        ezexr.imwrite(envmap_output_path, env_map_default.astype(np.float32))
+    else:
+        env_map_default = skimage.img_as_ubyte(env_map_default)
+        skimage.io.imsave(envmap_output_path, env_map_default)
+    return None
+def main():
+    # running time measuring
+    start_time = time.time()
+    # load arguments
+    args = create_argparser().parse_args()
+    # make output directory if not exist
+    os.makedirs(args.envmap_dir, exist_ok=True)
+    # get all file in the directory
+    files = sorted(os.listdir(args.ball_dir))
+    # create partial function for pararell processing
+    process_func = partial(process_image, args)
+    # pararell processing
+    with Pool(args.threads) as p:
+        list(tqdm(p.imap(process_func, files), total=len(files)))
+    # print total time
+    print("TOTAL TIME: ", time.time() - start_time)
+if __name__ == "__main__":
+    main()

environment.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+name: diffusionlight
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.11.6
+  - cudatoolkit=11.8
+  - openexr==3.2.1

example/bed.png ADDED Viewed

Git LFS Details

SHA256: 5d832f4c8a4954d7d05611c3b5ed39f86517953dd4d9bfed1753ad2402bbb090
Pointer size: 132 Bytes
Size of remote file: 1.79 MB

exposure2hdr.py ADDED Viewed

	@@ -0,0 +1,139 @@

+# covnert exposure bracket to HDR output
+import argparse
+import os
+from functools import partial
+from multiprocessing import Pool
+from tqdm import tqdm
+import numpy as np
+import skimage
+import ezexr
+from relighting.tonemapper import TonemapHDR
+def create_argparser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--input_dir", type=str, required=True, help='directory that contain the image') #dataset name or directory
+    parser.add_argument("--output_dir", type=str, required=True, help='directory that contain the image') #dataset name or directory
+    parser.add_argument("--endwith", type=str, default=".png" ,help='file ending to filter out unwant image')
+    parser.add_argument("--ev_string", type=str, default="_ev" ,help='string that use for search ev value')
+    parser.add_argument("--EV", type=str, default="0, -2.5, -5" ,help='avalible ev value')
+    parser.add_argument("--gamma", default=2.4, help="Gamma value", type=float)
+    parser.add_argument('--preview_output', dest='preview_output', action='store_true')
+    parser.set_defaults(preview_output=False)
+    return parser
+def parse_filename(ev_string, endwith,filename):
+    a = filename.split(ev_string)
+    name = ev_string.join(a[:-1])
+    ev = a[-1].replace(endwith, "")
+    ev = int(ev) / 10
+    return {
+        'name': name,
+        'ev': ev,
+        'filename': filename
+    }
+def process_image(args, info):
+    #output directory
+    hdrdir = args.output_dir
+    os.makedirs(hdrdir, exist_ok=True)
+    scaler = np.array([0.212671, 0.715160, 0.072169])
+    name = info['name']
+    # ev value for each file
+    evs = [e for e in sorted(info['ev'], reverse = True)]
+    # filename
+    files = [info['ev'][e] for e in evs]
+    # inital first image
+    image0 = skimage.io.imread(os.path.join(args.input_dir, files[0]))[...,:3]
+    image0 = skimage.img_as_float(image0)
+    image0_linear = np.power(image0, args.gamma)
+    # read luminace for every image
+    luminances = []
+    for i in range(len(evs)):
+        # load image
+        path = os.path.join(args.input_dir, files[i])
+        image = skimage.io.imread(path)[...,:3]
+        image = skimage.img_as_float(image)
+        # apply gama correction
+        linear_img = np.power(image, args.gamma)
+        # convert the brighness
+        linear_img *= 1 / (2 ** evs[i])
+        # compute luminace
+        lumi = linear_img @ scaler
+        luminances.append(lumi)
+    # start from darkest image
+    out_luminace = luminances[len(evs) - 1]
+    for i in range(len(evs) - 1, 0, -1):
+        # compute mask
+        maxval = 1 / (2 ** evs[i-1])
+        p1 = np.clip((luminances[i-1] - 0.9 * maxval) / (0.1 * maxval), 0, 1)
+        p2 = out_luminace > luminances[i-1]
+        mask = (p1 * p2).astype(np.float32)
+        out_luminace = luminances[i-1] * (1-mask) + out_luminace * mask
+    hdr_rgb = image0_linear * (out_luminace / (luminances[0] + 1e-10))[:, :, np.newaxis]
+    # tone map for visualization
+    hdr2ldr = TonemapHDR(gamma=args.gamma, percentile=99, max_mapping=0.9)
+    ldr_rgb, _, _ = hdr2ldr(hdr_rgb)
+    ezexr.imwrite(os.path.join(hdrdir, name+".exr"), hdr_rgb.astype(np.float32))
+    if args.preview_output:
+        preview_dir = os.path.join(args.output_dir, "preview")
+        os.makedirs(preview_dir, exist_ok=True)
+        bracket = []
+        for s in 2 ** np.linspace(0, evs[-1], 10): #evs[-1] is -5
+            lumi = np.clip((s * hdr_rgb) ** (1/args.gamma), 0, 1)
+            bracket.append(lumi)
+        bracket = np.concatenate(bracket, axis=1)
+        skimage.io.imsave(os.path.join(preview_dir, name+".png"), skimage.img_as_ubyte(bracket))
+    return None
+def main():
+    # load arguments
+    args = create_argparser().parse_args()
+    files = sorted(os.listdir(args.input_dir))
+    #filter file out with file ending
+    files = [f for f in files if f.endswith(args.endwith)]
+    evs = [float(e.strip()) for e in args.EV.split(",")]
+    # parse into useful data
+    files = [parse_filename(args.ev_string, args.endwith, f) for f in files]
+    # filter out unused ev
+    files = [f for f in files if f['ev'] in evs]
+    info = {}
+    for f in files:
+        if not f['name'] in info:
+            info[f['name']] = {}
+        info[f['name']][f['ev']] = f['filename']
+    infolist = []
+    for k in info:
+        if len(info[k]) != len(evs):
+            print("WARNING: missing ev in ", k)
+            continue
+        # convert to list data
+        infolist.append({'name': k, 'ev': info[k]})
+    fn = partial(process_image, args)
+    with Pool(8) as p:
+        r = list(tqdm(p.imap(fn, infolist), total=len(infolist)))
+if __name__ == "__main__":
+    main()

inpaint.py ADDED Viewed

	@@ -0,0 +1,363 @@

+# inpaint the ball on an image
+# this one is design for general image that does not require special location to place
+import torch
+import argparse
+import numpy as np
+import torch.distributed as dist
+import os
+from PIL import Image
+from tqdm.auto import tqdm
+import json
+from relighting.inpainter import BallInpainter
+from relighting.mask_utils import MaskGenerator
+from relighting.ball_processor import (
+    get_ideal_normal_ball,
+    crop_ball
+)
+from relighting.dataset import GeneralLoader
+from relighting.utils import name2hash
+import relighting.dist_utils as dist_util
+import time
+# cross import from inpaint_multi-illum.py
+from relighting.argument import (
+    SD_MODELS,
+    CONTROLNET_MODELS,
+    VAE_MODELS
+)
+def create_argparser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset", type=str, required=True ,help='directory that contain the image') #dataset name or directory
+    parser.add_argument("--ball_size", type=int, default=256, help="size of the ball in pixel")
+    parser.add_argument("--ball_dilate", type=int, default=20, help="How much pixel to dilate the ball to make a sharper edge")
+    parser.add_argument("--prompt", type=str, default="a perfect mirrored reflective chrome ball sphere")
+    parser.add_argument("--prompt_dark", type=str, default="a perfect black dark mirrored reflective chrome ball sphere")
+    parser.add_argument("--negative_prompt", type=str, default="matte, diffuse, flat, dull")
+    parser.add_argument("--model_option", default="sdxl", help='selecting fancy model option (sd15_old, sd15_new, sd21, sdxl, sdxl_turbo)') # [sd15_old, sd15_new, or sd21]
+    parser.add_argument("--output_dir", required=True, type=str, help="output directory")
+    parser.add_argument("--img_height", type=int, default=1024, help="Dataset Image Height")
+    parser.add_argument("--img_width", type=int, default=1024, help="Dataset Image Width")
+    # some good seed 0, 37, 71, 125, 140, 196, 307, 434, 485, 575 | 9021, 9166, 9560, 9814, but default auto is for fairness
+    parser.add_argument("--seed", default="auto", type=str, help="Seed: right now we use single seed instead to reduce the time, (Auto will use hash file name to generate seed)")
+    parser.add_argument("--denoising_step", default=30, type=int, help="number of denoising step of diffusion model")
+    parser.add_argument("--control_scale", default=0.5, type=float, help="controlnet conditioning scale")
+    parser.add_argument("--guidance_scale", default=5.0, type=float, help="guidance scale (also known as CFG)")
+    parser.add_argument('--no_controlnet', dest='use_controlnet', action='store_false', help='by default we using controlnet, we have option to disable to see the different')
+    parser.set_defaults(use_controlnet=True)
+    parser.add_argument('--no_force_square', dest='force_square', action='store_false', help='SDXL is trained for square image, we prefered the square input. but you use this option to disable reshape')
+    parser.set_defaults(force_square=True)
+    parser.add_argument('--no_random_loader', dest='random_loader', action='store_false', help="by default, we random how dataset load. This make us able to peak into the trend of result without waiting entire dataset. but can disable if prefereed")
+    parser.set_defaults(random_loader=True)
+    parser.add_argument('--cpu', dest='is_cpu', action='store_true', help="using CPU inference instead of GPU inference")
+    parser.set_defaults(is_cpu=False)
+    parser.add_argument('--offload', dest='offload', action='store_false', help="to enable diffusers cpu offload")
+    parser.set_defaults(offload=False)
+    parser.add_argument("--limit_input", default=0, type=int, help="limit number of image to process to n image (0 = no limit), useful for run smallset")
+    # LoRA stuff
+    parser.add_argument('--no_lora', dest='use_lora', action='store_false', help='by default we using lora, we have option to disable to see the different')
+    parser.set_defaults(use_lora=True)
+    parser.add_argument("--lora_path", default="models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500", type=str, help="LoRA Checkpoint path")
+    parser.add_argument("--lora_scale", default=0.75, type=float, help="LoRA scale factor")
+    # speed optimization stuff
+    parser.add_argument('--no_torch_compile', dest='use_torch_compile', action='store_false', help='by default we using torch compile for faster processing speed. disable it if your environemnt is lower than pytorch2.0')
+    parser.set_defaults(use_torch_compile=True)
+    # algorithm + iterative stuff
+    parser.add_argument("--algorithm", type=str, default="iterative", choices=["iterative", "normal"], help="Selecting between iterative or normal (single pass inpaint) algorithm")
+    parser.add_argument("--agg_mode", default="median", type=str)
+    parser.add_argument("--strength", default=0.8, type=float)
+    parser.add_argument("--num_iteration", default=2, type=int)
+    parser.add_argument("--ball_per_iteration", default=30, type=int)
+    parser.add_argument('--no_save_intermediate', dest='save_intermediate', action='store_false')
+    parser.set_defaults(save_intermediate=True)
+    parser.add_argument("--cache_dir", default="./temp_inpaint_iterative", type=str, help="cache directory for iterative inpaint")
+    # pararelle processing
+    parser.add_argument("--idx", default=0, type=int, help="index of the current process, useful for running on multiple node")
+    parser.add_argument("--total", default=1, type=int, help="total number of process")
+    # for HDR stuff
+    parser.add_argument("--max_negative_ev", default=-5, type=int, help="maximum negative EV for lora")
+    parser.add_argument("--ev", default="0,-2.5,-5", type=str, help="EV: list of EV to generate")
+    return parser
+def get_ball_location(image_data, args):
+    if 'boundary' in image_data:
+        # support predefined boundary if need
+        x = image_data["boundary"]["x"]
+        y = image_data["boundary"]["y"]
+        r = image_data["boundary"]["size"]
+        # support ball dilation
+        half_dilate = args.ball_dilate // 2
+        # check if not left out-of-bound
+        if x - half_dilate < 0: x += half_dilate
+        if y - half_dilate < 0: y += half_dilate
+        # check if not right out-of-bound
+        if x + r + half_dilate > args.img_width: x -= half_dilate
+        if y + r + half_dilate > args.img_height: y -= half_dilate
+    else:
+        # we use top-left corner notation
+        x, y, r = ((args.img_width // 2) - (args.ball_size // 2), (args.img_height // 2) - (args.ball_size // 2), args.ball_size)
+    return x, y, r
+def interpolate_embedding(pipe, args):
+    print("interpolate embedding...")
+    # get list of all EVs
+    ev_list = [float(x) for x in args.ev.split(",")]
+    interpolants = [ev / args.max_negative_ev for ev in ev_list]
+    print("EV : ", ev_list)
+    print("EV : ", interpolants)
+    # calculate prompt embeddings
+    prompt_normal = args.prompt
+    prompt_dark = args.prompt_dark
+    prompt_embeds_normal, _, pooled_prompt_embeds_normal, _ = pipe.pipeline.encode_prompt(prompt_normal)
+    prompt_embeds_dark, _, pooled_prompt_embeds_dark, _ = pipe.pipeline.encode_prompt(prompt_dark)
+    # interpolate embeddings
+    interpolate_embeds = []
+    for t in interpolants:
+        int_prompt_embeds = prompt_embeds_normal + t * (prompt_embeds_dark - prompt_embeds_normal)
+        int_pooled_prompt_embeds = pooled_prompt_embeds_normal + t * (pooled_prompt_embeds_dark - pooled_prompt_embeds_normal)
+        interpolate_embeds.append((int_prompt_embeds, int_pooled_prompt_embeds))
+    return dict(zip(ev_list, interpolate_embeds))
+def main():
+    # load arguments
+    args = create_argparser().parse_args()
+    # get local rank
+    if args.is_cpu:
+        device = torch.device("cpu")
+        torch_dtype = torch.float32
+    else:
+        device = dist_util.dev()
+        torch_dtype = torch.float16
+    # so, we need ball_dilate >= 16 (2*vae_scale_factor) to make our mask shape = (272, 272)
+    assert args.ball_dilate % 2 == 0 # ball dilation should be symmetric
+    # create controlnet pipeline
+    if args.model_option in ["sdxl", "sdxl_fast", "sdxl_turbo"] and args.use_controlnet:
+        model, controlnet = SD_MODELS[args.model_option], CONTROLNET_MODELS[args.model_option]
+        pipe = BallInpainter.from_sdxl(
+            model=model,
+            controlnet=controlnet,
+            device=device,
+            torch_dtype = torch_dtype,
+            offload = args.offload
+        )
+    elif args.model_option in ["sdxl", "sdxl_fast", "sdxl_turbo"] and not args.use_controlnet:
+        model = SD_MODELS[args.model_option]
+        pipe = BallInpainter.from_sdxl(
+            model=model,
+            controlnet=None,
+            device=device,
+            torch_dtype = torch_dtype,
+            offload = args.offload
+        )
+    elif args.use_controlnet:
+        model, controlnet = SD_MODELS[args.model_option], CONTROLNET_MODELS[args.model_option]
+        pipe = BallInpainter.from_sd(
+            model=model,
+            controlnet=controlnet,
+            device=device,
+            torch_dtype = torch_dtype,
+            offload = args.offload
+        )
+    else:
+        model = SD_MODELS[args.model_option]
+        pipe = BallInpainter.from_sd(
+            model=model,
+            controlnet=None,
+            device=device,
+            torch_dtype = torch_dtype,
+            offload = args.offload
+        )
+    if args.model_option in ["sdxl_turbo"]:
+        # Guidance scale is not supported in sdxl_turbo
+        args.guidance_scale = 0.0
+    if args.lora_scale > 0 and args.lora_path is None:
+        raise ValueError("lora scale is not 0 but lora path is not set")
+    if (args.lora_path is not None) and (args.use_lora):
+        print(f"using lora path {args.lora_path}")
+        print(f"using lora scale {args.lora_scale}")
+        pipe.pipeline.load_lora_weights(args.lora_path)
+        pipe.pipeline.fuse_lora(lora_scale=args.lora_scale) # fuse lora weight w' = w + \alpha \Delta w
+        enabled_lora = True
+    else:
+        enabled_lora = False
+    if args.use_torch_compile:
+        try:
+            print("compiling unet model")
+            start_time = time.time()
+            pipe.pipeline.unet = torch.compile(pipe.pipeline.unet, mode="reduce-overhead", fullgraph=True)
+            print("Model compilation time: ", time.time() - start_time)
+        except:
+            pass
+    # default height for sdxl is 1024, if not set, we set default height.
+    if args.model_option == "sdxl" and args.img_height == 0 and args.img_width == 0:
+        args.img_height = 1024
+        args.img_width = 1024
+    # load dataset
+    dataset = GeneralLoader(
+        root=args.dataset,
+        resolution=(args.img_width, args.img_height),
+        force_square=args.force_square,
+        return_dict=True,
+        random_shuffle=args.random_loader,
+        process_id=args.idx,
+        process_total=args.total,
+        limit_input=args.limit_input,
+    )
+    # interpolate embedding
+    embedding_dict = interpolate_embedding(pipe, args)
+    # prepare mask and normal ball
+    mask_generator = MaskGenerator()
+    normal_ball, mask_ball = get_ideal_normal_ball(size=args.ball_size+args.ball_dilate)
+    _, mask_ball_for_crop = get_ideal_normal_ball(size=args.ball_size)
+    # make output directory if not exist
+    raw_output_dir = os.path.join(args.output_dir, "raw")
+    control_output_dir = os.path.join(args.output_dir, "control")
+    square_output_dir = os.path.join(args.output_dir, "square")
+    os.makedirs(args.output_dir, exist_ok=True)
+    os.makedirs(raw_output_dir, exist_ok=True)
+    os.makedirs(control_output_dir, exist_ok=True)
+    os.makedirs(square_output_dir, exist_ok=True)
+    # create split seed
+    # please DO NOT manual replace this line, use --seed option instead
+    seeds = args.seed.split(",")
+    for image_data in tqdm(dataset):
+        input_image = image_data["image"]
+        image_path = image_data["path"]
+        for ev, (prompt_embeds, pooled_prompt_embeds) in embedding_dict.items():
+            # create output file name (we always use png to prevent quality loss)
+            ev_str = str(ev).replace(".", "") if ev != 0 else "-00"
+            outname = os.path.basename(image_path).split(".")[0] + f"_ev{ev_str}"
+            # we use top-left corner notation (which is different from aj.aek's center point notation)
+            x, y, r = get_ball_location(image_data, args)
+            # create inpaint mask
+            mask = mask_generator.generate_single(
+                input_image, mask_ball,
+                x - (args.ball_dilate // 2),
+                y - (args.ball_dilate // 2),
+                r + args.ball_dilate
+            )
+            seeds = tqdm(seeds, desc="seeds") if len(seeds) > 10 else seeds
+            #replacely create image with differnt seed
+            for seed in seeds:
+                start_time = time.time()
+                # set seed, if seed auto we use file name as seed
+                if seed == "auto":
+                    filename = os.path.basename(image_path).split(".")[0]
+                    seed = name2hash(filename)
+                    outpng = f"{outname}.png"
+                    cache_name = f"{outname}"
+                else:
+                    seed = int(seed)
+                    outpng = f"{outname}_seed{seed}.png"
+                    cache_name = f"{outname}_seed{seed}"
+                # skip if file exist, useful for resuming
+                if os.path.exists(os.path.join(square_output_dir, outpng)):
+                    continue
+                generator = torch.Generator().manual_seed(seed)
+                kwargs = {
+                    "prompt_embeds": prompt_embeds,
+                    "pooled_prompt_embeds": pooled_prompt_embeds,
+                    'negative_prompt': args.negative_prompt,
+                    'num_inference_steps': args.denoising_step,
+                    'generator': generator,
+                    'image': input_image,
+                    'mask_image': mask,
+                    'strength': 1.0,
+                    'current_seed': seed, # we still need seed in the pipeline!
+                    'controlnet_conditioning_scale': args.control_scale,
+                    'height': args.img_height,
+                    'width': args.img_width,
+                    'normal_ball': normal_ball,
+                    'mask_ball': mask_ball,
+                    'x': x,
+                    'y': y,
+                    'r': r,
+                    'guidance_scale': args.guidance_scale,
+                }
+                if enabled_lora:
+                    kwargs["cross_attention_kwargs"] = {"scale": args.lora_scale}
+                if args.algorithm == "normal":
+                    output_image = pipe.inpaint(**kwargs).images[0]
+                elif args.algorithm == "iterative":
+                    # This is still buggy
+                    print("using inpainting iterative, this is going to take a while...")
+                    kwargs.update({
+                        "strength": args.strength,
+                        "num_iteration": args.num_iteration,
+                        "ball_per_iteration": args.ball_per_iteration,
+                        "agg_mode": args.agg_mode,
+                        "save_intermediate": args.save_intermediate,
+                        "cache_dir": os.path.join(args.cache_dir, cache_name),
+                    })
+                    output_image = pipe.inpaint_iterative(**kwargs)
+                else:
+                    raise NotImplementedError(f"Unknown algorithm {args.algorithm}")
+                square_image = output_image.crop((x, y, x+r, y+r))
+                # return the most recent control_image for sanity check
+                control_image = pipe.get_cache_control_image()
+                if control_image is not None:
+                    control_image.save(os.path.join(control_output_dir, outpng))
+                # save image
+                output_image.save(os.path.join(raw_output_dir, outpng))
+                square_image.save(os.path.join(square_output_dir, outpng))
+if __name__ == "__main__":
+    main()

models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/optimizer.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b96eb34accf4ce5f33dff729e12669c46e3132973ee2cf0ac2d4f2c993a2af4d
+size 47392882

models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/pytorch_lora_weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6bf601e30c08ce4f1eea6e09e1780bd8ba2588986eaee8379672350707dcddaa
+size 23396024

models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/random_states_0.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec103b2e80b357d29e2ec8355d49f0c289331f0ece810d5bafd464d33e5f4c76
+size 14280

models/ThisIsTheFinal-lora-hdr-continuous-largeT@900/0_-5/checkpoint-2500/scheduler.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9244a743d4761f975ab2a14d0b7a509a85554dd77bef6490330160a2a639fae
+size 1000

relighting/argument.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import argparse
+from diffusers import DDIMScheduler, DDPMScheduler, UniPCMultistepScheduler
+def get_control_signal_type(controlnet):
+    if "normal" in controlnet:
+        return "normal"
+    elif "depth" in controlnet:
+        return "depth"
+    else:
+        raise NotImplementedError
+SD_MODELS = {
+    "sd15_old": "runwayml/stable-diffusion-inpainting",
+    "sd15_new": "runwayml/stable-diffusion-inpainting",
+    "sd21": "stabilityai/stable-diffusion-2-inpainting",
+    "sdxl": "stabilityai/stable-diffusion-xl-base-1.0",
+    "sdxl_fast": "stabilityai/stable-diffusion-xl-base-1.0",
+    "sdxl_turbo": "stabilityai/sdxl-turbo",
+    "sd15_depth": "runwayml/stable-diffusion-inpainting",
+}
+VAE_MODELS = {
+    "sdxl": "madebyollin/sdxl-vae-fp16-fix",
+    "sdxl_fast": "madebyollin/sdxl-vae-fp16-fix",
+}
+CONTROLNET_MODELS = {
+    "sd15_old": "fusing/stable-diffusion-v1-5-controlnet-normal",
+    "sd15_new": "lllyasviel/control_v11p_sd15_normalbae",
+    "sd21": "thibaud/controlnet-sd21-normalbae-diffusers",
+    "sdxl": "diffusers/controlnet-depth-sdxl-1.0",
+    "sdxl_fast": "diffusers/controlnet-depth-sdxl-1.0-small",
+    "sdxl_turbo": "diffusers/controlnet-depth-sdxl-1.0-small",
+    "sd15_depth": "lllyasviel/control_v11f1p_sd15_depth",
+}
+SAMPLERS = {
+    "ddim": DDIMScheduler,
+    "ddpm": DDPMScheduler,
+    "unipc": UniPCMultistepScheduler,
+}
+DEPTH_ESTIMATOR = "Intel/dpt-hybrid-midas"

relighting/ball_processor.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import torch
+import numpy as np
+from PIL import Image
+from scipy.special import sph_harm
+def crop_ball(image, mask_ball, x, y, size, apply_mask=True, bg_color = (0, 0, 0)):
+    if isinstance(image, Image.Image):
+        result = np.array(image)
+    else:
+        result = image.copy()
+    result = result[y:y+size, x:x+size]
+    if apply_mask:
+        result[~mask_ball] = bg_color
+    return result
+def get_ideal_normal_ball(size, flip_x=True):
+    """
+    Generate normal ball for specific size
+    Normal map is x "left", y up, z into the screen
+    (we flip X to match sobel operator)
+    @params
+        - size (int) - single value of height and width
+    @return:
+        - normal_map (np.array) - normal map [size, size, 3]
+        - mask (np.array) - mask that make a valid normal map [size,size]
+    """
+    # we flip x to match sobel operator
+    x = torch.linspace(1, -1, size)
+    y = torch.linspace(1, -1, size)
+    x = x.flip(dims=(-1,)) if not flip_x else x
+    y, x = torch.meshgrid(y, x)
+    z = (1 - x**2 - y**2)
+    mask = z >= 0
+    # clean up invalid value outsize the mask
+    x = x * mask
+    y = y * mask
+    z = z * mask
+    # get real z value
+    z = torch.sqrt(z)
+    # clean up normal map value outside mask
+    normal_map = torch.cat([x[..., None], y[..., None], z[..., None]], dim=-1)
+    normal_map = normal_map.numpy()
+    mask = mask.numpy()
+    return normal_map, mask
+def get_predicted_normal_ball(size, precomputed_path=None):
+    if precomputed_path is not None:
+        normal_map = Image.open(precomputed_path).resize((size, size))
+        normal_map = np.array(normal_map).astype(np.uint8)
+        _, mask = get_ideal_normal_ball(size)
+    else:
+        raise NotImplementedError
+    normal_map = (normal_map - 127.5) / 127.5 # normalize for compatibility with inpainting pipeline
+    return normal_map, mask

relighting/dataset.py ADDED Viewed

	@@ -0,0 +1,412 @@

+import glob
+import json
+import os
+import skimage
+import numpy as np
+from pathlib import Path
+from natsort import natsorted
+from PIL import Image
+from relighting.image_processor import pil_square_image
+from tqdm.auto import tqdm
+import random
+import itertools
+from abc import ABC, abstractmethod
+class Dataset(ABC):
+    def __init__(self,
+                 resolution=(1024, 1024),
+                 force_square=True,
+                 return_image_path=False,
+                 return_dict=False,
+        ):
+        """
+        Resoution is (WIDTH, HEIGHT)
+        """
+        self.resolution = resolution
+        self.force_square = force_square
+        self.return_image_path = return_image_path
+        self.return_dict = return_dict
+        self.scene_data = []
+        self.meta_data = []
+        self.boundary_info = []
+    @abstractmethod
+    def _load_data_path(self):
+        pass
+    def __len__(self):
+        return len(self.scene_data)
+    def __getitem__(self, idx):
+        image = Image.open(self.scene_data[idx])
+        if self.force_square:
+            image = pil_square_image(image, self.resolution)
+        else:
+            image = image.resize(self.resolution)
+        if self.return_dict:
+            d = {
+                "image": image,
+                "path": self.scene_data[idx]
+            }
+            if len(self.boundary_info) > 0:
+                d["boundary"] = self.boundary_info[idx]
+            return d
+        elif self.return_image_path:
+            return image, self.scene_data[idx]
+        else:
+            return image
+class GeneralLoader(Dataset):
+    def __init__(self,
+                 root=None,
+                 num_samples=None,
+                 res_threshold=((1024, 1024)),
+                 apply_threshold=False,
+                 random_shuffle=False,
+                 process_id = 0,
+                 process_total = 1,
+                 limit_input = 0,
+                 **kwargs,
+        ):
+        super().__init__(**kwargs)
+        self.root = root
+        self.res_threshold = res_threshold
+        self.apply_threshold = apply_threshold
+        self.has_meta = False
+        if self.root is not None:
+            if not os.path.exists(self.root):
+                raise Exception(f"Dataset {self.root} does not exist.")
+            paths = natsorted(
+                list(glob.glob(os.path.join(self.root, "*.png"))) + \
+                list(glob.glob(os.path.join(self.root, "*.jpg")))
+            )
+            self.scene_data = self._load_data_path(paths, num_samples=num_samples)
+            if random_shuffle:
+                SEED = 0
+                random.Random(SEED).shuffle(self.scene_data)
+                random.Random(SEED).shuffle(self.boundary_info)
+            if limit_input > 0:
+                self.scene_data = self.scene_data[:limit_input]
+                self.boundary_info = self.boundary_info[:limit_input]
+            # please keep this one the last, so, we will filter out scene_data and boundary info
+            if process_total > 1:
+                self.scene_data = self.scene_data[process_id::process_total]
+                self.boundary_info = self.boundary_info[process_id::process_total]
+                print(f"Process {process_id} has {len(self.scene_data)} samples")
+    def _load_data_path(self, paths, num_samples=None):
+        if os.path.exists(os.path.splitext(paths[0])[0] + ".json") or os.path.exists(os.path.splitext(paths[-1])[0] + ".json"):
+            self.has_meta = True
+        if self.has_meta:
+            # read metadata
+            TARGET_KEY = "chrome_mask256"
+            for path in paths:
+                with open(os.path.splitext(path)[0] + ".json") as f:
+                    meta = json.load(f)
+                    self.meta_data.append(meta)
+                    boundary =  {
+                        "x": meta[TARGET_KEY]["x"],
+                        "y": meta[TARGET_KEY]["y"],
+                        "size": meta[TARGET_KEY]["w"],
+                    }
+                    self.boundary_info.append(boundary)
+        scene_data = paths
+        if self.apply_threshold:
+            scene_data = []
+            for path in tqdm(paths):
+                img = Image.open(path)
+                if (img.size[0] >= self.res_threshold[0]) and (img.size[1] >= self.res_threshold[1]):
+                    scene_data.append(path)
+        if num_samples is not None:
+            max_idx = min(num_samples, len(scene_data))
+            scene_data = scene_data[:max_idx]
+        return scene_data
+    @classmethod
+    def from_image_paths(cls, paths, *args, **kwargs):
+        dataset = cls(*args, **kwargs)
+        dataset.scene_data = dataset._load_data_path(paths)
+        return dataset
+class ALPLoader(Dataset):
+    def __init__(self,
+                 root=None,
+                 num_samples=None,
+                 res_threshold=((1024, 1024)),
+                 apply_threshold=False,
+                 **kwargs,
+        ):
+        super().__init__(**kwargs)
+        self.root = root
+        self.res_threshold = res_threshold
+        self.apply_threshold = apply_threshold
+        self.has_meta = False
+        if self.root is not None:
+            if not os.path.exists(self.root):
+                raise Exception(f"Dataset {self.root} does not exist.")
+            dirs = natsorted(list(glob.glob(os.path.join(self.root, "*"))))
+            self.scene_data = self._load_data_path(dirs)
+    def _load_data_path(self, dirs):
+        self.scene_names = [Path(dir).name for dir in dirs]
+        scene_data = []
+        for dir in dirs:
+            pseudo_probe_dirs = natsorted(list(glob.glob(os.path.join(dir, "*"))))
+            pseudo_probe_dirs = [dir for dir in pseudo_probe_dirs if "gt" not in dir]
+            data = [os.path.join(dir, "images", "0.png") for dir in pseudo_probe_dirs]
+            scene_data.append(data)
+        scene_data = list(itertools.chain(*scene_data))
+        return scene_data
+class MultiIlluminationLoader(Dataset):
+    def __init__(self,
+                root,
+                mask_probe=True,
+                mask_boundingbox=False,
+                **kwargs,
+        ):
+        """
+        @params resolution (tuple): (width, height) - resolution of the image
+        @params force_square: will add black border to make the image square while keeping the aspect ratio
+        @params mask_probe: mask the probe with the mask in the dataset
+        """
+        super().__init__(**kwargs)
+        self.root = root
+        self.mask_probe = mask_probe
+        self.mask_boundingbox = mask_boundingbox
+        if self.root is not None:
+            dirs = natsorted(list(glob.glob(os.path.join(self.root, "*"))))
+            self.scene_data = self._load_data_path(dirs)
+    def _load_data_path(self, dirs):
+        self.scene_names = [Path(dir).name for dir in dirs]
+        data = {}
+        for dir in dirs:
+            chrome_probes = natsorted(list(glob.glob(os.path.join(dir, "probes", "*chrome*.jpg"))))
+            gray_probes = natsorted(list(glob.glob(os.path.join(dir, "probes", "*gray*.jpg"))))
+            scenes = natsorted(list(glob.glob(os.path.join(dir, "dir_*.jpg"))))
+            with open(os.path.join(dir, "meta.json")) as f:
+                meta_data = json.load(f)
+            bbox_chrome = meta_data["chrome"]["bounding_box"]
+            bbox_gray = meta_data["gray"]["bounding_box"]
+            mask_chrome = os.path.join(dir, "mask_chrome.png")
+            mask_gray = os.path.join(dir, "mask_gray.png")
+            scene_name = Path(dir).name
+            data[scene_name] = {
+                "scenes": scenes,
+                "chrome_probes": chrome_probes,
+                "gray_probes": gray_probes,
+                "bbox_chrome": bbox_chrome,
+                "bbox_gray": bbox_gray,
+                "mask_chrome": mask_chrome,
+                "mask_gray": mask_gray,
+            }
+        return data
+    def _mask_probe(self, image, mask):
+        """
+        mask probe with a png file in dataset
+        """
+        image_anticheat = skimage.img_as_float(np.array(image))
+        mask_np = skimage.img_as_float(np.array(mask))[..., None]
+        image_anticheat = ((1.0 - mask_np) * image_anticheat) + (0.5 * mask_np)
+        image_anticheat = Image.fromarray(skimage.img_as_ubyte(image_anticheat))
+        return image_anticheat
+    def _mask_boundingbox(self, image, bbox):
+        """
+        mask image with the bounding box for anti-cheat
+        """
+        bbox = {k:int(np.round(v/4.0)) for k,v in bbox.items()}
+        x, y, w, h = bbox["x"], bbox["y"], bbox["w"], bbox["h"]
+        image_anticheat = skimage.img_as_float(np.array(image))
+        image_anticheat[y:y+h, x:x+w] = 0.5
+        image_anticheat = Image.fromarray(skimage.img_as_ubyte(image_anticheat))
+        return image_anticheat
+    def __getitem__(self, scene_name):
+        data = self.scene_data[scene_name]
+        mask_chrome = Image.open(data["mask_chrome"])
+        mask_gray = Image.open(data["mask_gray"])
+        images = []
+        for path in data["scenes"]:
+            image = Image.open(path)
+            if self.mask_probe:
+                image = self._mask_probe(image, mask_chrome)
+                image = self._mask_probe(image, mask_gray)
+            if self.mask_boundingbox:
+                image = self._mask_boundingbox(image, data["bbox_chrome"])
+                image = self._mask_boundingbox(image, data["bbox_gray"])
+            if self.force_square:
+                image = pil_square_image(image, self.resolution)
+            else:
+                image = image.resize(self.resolution)
+            images.append(image)
+        chrome_probes = [Image.open(path) for path in data["chrome_probes"]]
+        gray_probes = [Image.open(path) for path in data["gray_probes"]]
+        bbox_chrome = data["bbox_chrome"]
+        bbox_gray = data["bbox_gray"]
+        return images, chrome_probes, gray_probes, bbox_chrome, bbox_gray
+    def calculate_ball_info(self, scene_name):
+        # TODO: remove hard-coded parameters
+        ball_data = []
+        for mtype in ['bbox_chrome', 'bbox_gray']:
+            info = self.scene_data[scene_name][mtype]
+            # x-y is top-left corner of the bounding box
+            # meta file is for 4000x6000 image but dataset is 1000x1500
+            x = info['x'] / 4
+            y = info['y'] / 4
+            w = info['w'] / 4
+            h = info['h'] / 4
+            # we scale data to 512x512 image
+            if self.force_square:
+                h_ratio = (512.0 * 2.0 / 3.0) / 1000.0    #384 because we have black border on the top
+                w_ratio = 512.0 / 1500.0
+            else:
+                h_ratio = self.resolution[0] / 1000.0
+                w_ratio = self.resolution[1] / 1500.0
+            x = x * w_ratio
+            y = y * h_ratio
+            w = w * w_ratio
+            h = h * h_ratio
+            if self.force_square:
+                # y need to shift due to top black border
+                top_border_height = 512.0 * (1/6)
+                y = y + top_border_height
+            # Sphere is not circle due to the camera perspective, Need future fix for this
+            # For now, we use the minimum of width and height
+            w = int(np.round(w))
+            h = int(np.round(h))
+            if w > h:
+                r = h
+                x = x + (w - h) / 2.0
+            else:
+                r = w
+                y = y + (h - w) / 2.0
+            x = int(np.round(x))
+            y = int(np.round(y))
+            ball_data.append((x, y, r))
+        return ball_data
+    def calculate_bbox_info(self, scene_name):
+        # TODO: remove hard-coded parameters
+        bbox_data = []
+        for mtype in ['bbox_chrome', 'bbox_gray']:
+            info = self.scene_data[scene_name][mtype]
+            # x-y is top-left corner of the bounding box
+            # meta file is for 4000x6000 image but dataset is 1000x1500
+            x = info['x'] / 4
+            y = info['y'] / 4
+            w = info['w'] / 4
+            h = info['h'] / 4
+            # we scale data to 512x512 image
+            if self.force_square:
+                h_ratio = (512.0 * 2.0 / 3.0) / 1000.0    #384 because we have black border on the top
+                w_ratio = 512.0 / 1500.0
+            else:
+                h_ratio = self.resolution[0] / 1000.0
+                w_ratio = self.resolution[1] / 1500.0
+            x = x * w_ratio
+            y = y * h_ratio
+            w = w * w_ratio
+            h = h * h_ratio
+            if self.force_square:
+                # y need to shift due to top black border
+                top_border_height = 512.0 * (1/6)
+                y = y + top_border_height
+            w = int(np.round(w))
+            h = int(np.round(h))
+            x = int(np.round(x))
+            y = int(np.round(y))
+            bbox_data.append((x, y, w, h))
+        return bbox_data
+    """
+    DO NOT remove this!
+    This is for evaluating results from Multi-Illumination generated from the old version
+    """
+    def calculate_ball_info_legacy(self, scene_name):
+        # TODO: remove hard-coded parameters
+        ball_data = []
+        for mtype in ['bbox_chrome', 'bbox_gray']:
+            info = self.scene_data[scene_name][mtype]
+            # x-y is top-left corner of the bounding box
+            # meta file is for 4000x6000 image but dataset is 1000x1500
+            x = info['x'] / 4
+            y = info['y'] / 4
+            w = info['w'] / 4
+            h = info['h'] / 4
+            # we scale data to 512x512 image
+            h_ratio = 384.0 / 1000.0    #384 because we have black border on the top
+            w_ratio = 512.0 / 1500.0
+            x = x * w_ratio
+            y = y * h_ratio
+            w = w * w_ratio
+            h = h * h_ratio
+            # y need to shift due to top black border
+            top_border_height = 512.0 * (1/8)
+            y = y + top_border_height
+            # Sphere is not circle due to the camera perspective, Need future fix for this
+            # For now, we use the minimum of width and height
+            r = np.max(np.array([w, h]))
+            x = int(np.round(x))
+            y = int(np.round(y))
+            r = int(np.round(r))
+            ball_data.append((y, x, r))
+        return ball_data

relighting/dist_utils.py ADDED Viewed

	@@ -0,0 +1,154 @@

+"""
+Helpers for distributed training.
+"""
+import io
+import os
+import socket
+try:
+    import blobfile as bf
+except:
+    pass
+try:
+    from mpi4py import MPI
+except:
+    pass
+import torch as th
+import torch.distributed as dist
+import builtins
+import datetime
+# Change this to reflect your cluster layout.
+# The GPU for a given rank is (rank % GPUS_PER_NODE).
+GPUS_PER_NODE = 8
+SETUP_RETRY_COUNT = 3
+def synchronize():
+    if not dist.is_available():
+        return
+    if not dist.is_initialized():
+        return
+    world_size = dist.get_world_size()
+    if world_size == 1:
+        return
+    dist.barrier()
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    builtin_print = builtins.print
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        force = force or (get_world_size() > 8)
+        if is_master or force:
+            now = datetime.datetime.now().time()
+            builtin_print('[{}] '.format(now), end='')  # print with time stamp
+            builtin_print(*args, **kwargs)
+    builtins.print = print
+def setup_dist_multinode(args):
+    """
+    Setup a distributed process group.
+    """
+    if not dist.is_available() or not dist.is_initialized():
+        th.distributed.init_process_group(backend="nccl", init_method='env://')
+        world_size = dist.get_world_size()
+        local_rank = int(os.getenv('LOCAL_RANK'))
+        print("rank",local_rank)
+        device = local_rank
+        th.cuda.set_device(device)
+        setup_for_distributed(device == 0)
+        synchronize()
+    else:
+        print("ddp failed!")
+        exit()
+def setup_dist(global_seed):
+    """
+    Setup a distributed process group.
+    """
+    if dist.is_initialized():
+        return
+    th.cuda.set_device(int(os.environ["LOCAL_RANK"]))
+    th.distributed.init_process_group(backend="nccl", init_method="env://", timeout=datetime.timedelta(seconds=5400))
+    # fix seed
+    rank = dist.get_rank()
+    device = rank % th.cuda.device_count()
+    seed = global_seed * dist.get_world_size() + rank
+    th.manual_seed(seed)
+    th.cuda.set_device(device)
+    print(f"Starting rank={rank}, seed={seed}, world_size={dist.get_world_size()}.")
+    synchronize()
+def dev():
+    """
+    Get the device to use for torch.distributed.
+    """
+    if th.cuda.is_available():
+        return th.device(f"cuda")
+    return th.device("cpu")
+def load_state_dict(path, **kwargs):
+    """
+    Load a PyTorch file without redundant fetches across MPI ranks.
+    """
+    chunk_size = 2 ** 30  # MPI has a relatively small size limit
+    if MPI.COMM_WORLD.Get_rank() == 0:
+        with bf.BlobFile(path, "rb") as f:
+            data = f.read()
+        num_chunks = len(data) // chunk_size
+        if len(data) % chunk_size:
+            num_chunks += 1
+        MPI.COMM_WORLD.bcast(num_chunks)
+        for i in range(0, len(data), chunk_size):
+            MPI.COMM_WORLD.bcast(data[i : i + chunk_size])
+    else:
+        num_chunks = MPI.COMM_WORLD.bcast(None)
+        data = bytes()
+        for _ in range(num_chunks):
+            data += MPI.COMM_WORLD.bcast(None)
+    return th.load(io.BytesIO(data), **kwargs)
+def sync_params(params):
+    """
+    Synchronize a sequence of Tensors across ranks from rank 0.
+    """
+    for p in params:
+        with th.no_grad():
+            dist.broadcast(p, 0)
+def _find_free_port():
+    try:
+        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+        s.bind(("", 0))
+        s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+        return s.getsockname()[1]
+    finally:
+        s.close()

relighting/image_processor.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import torch
+import numpy as np
+from PIL import Image, ImageChops
+import skimage
+try:
+    import cv2
+except:
+    pass
+def fill_image(image, mask_ball, x, y, size, color=(255,255,255)):
+    if isinstance(image, Image.Image):
+        result = np.array(image)
+    else:
+        result = image.copy()
+    result[y:y+size, x:x+size][mask_ball] = color
+    if isinstance(image, Image.Image):
+        result = Image.fromarray(result)
+    return result
+def pil_square_image(image, desired_size = (512,512), interpolation=Image.LANCZOS):
+    """
+    Make top-bottom border
+    """
+    # Don't resize if already desired size (Avoid aliasing problem)
+    if image.size == desired_size:
+        return image
+    # Calculate the scale factor
+    scale_factor = min(desired_size[0] / image.width, desired_size[1] / image.height)
+    # Resize the image
+    resized_image = image.resize((int(image.width * scale_factor), int(image.height * scale_factor)), interpolation)
+    # Create a new blank image with the desired size and black border
+    new_image = Image.new("RGB", desired_size, color=(0, 0, 0))
+    # Paste the resized image onto the new image, centered
+    new_image.paste(resized_image, ((desired_size[0] - resized_image.width) // 2, (desired_size[1] - resized_image.height) // 2))
+    return new_image
+# https://stackoverflow.com/questions/19271692/removing-borders-from-an-image-in-python
+def remove_borders(image):
+    bg = Image.new(image.mode, image.size, image.getpixel((0,0)))
+    diff = ImageChops.difference(image, bg)
+    diff = ImageChops.add(diff, diff, 2.0, -100)
+    bbox = diff.getbbox()
+    if bbox:
+        return image.crop(bbox)
+# Taken from https://huggingface.co/lllyasviel/sd-controlnet-normal
+def estimate_scene_normal(image, depth_estimator):
+    # can be improve speed do not going back and float between numpy and torch
+    normal_image = depth_estimator(image)['predicted_depth'][0]
+    normal_image = normal_image.numpy()
+    # upsizing image depth to match input
+    hw = np.array(image).shape[:2]
+    normal_image = skimage.transform.resize(normal_image, hw, preserve_range=True)
+    image_depth = normal_image.copy()
+    image_depth -= np.min(image_depth)
+    image_depth /= np.max(image_depth)
+    bg_threhold = 0.4
+    x = cv2.Sobel(normal_image, cv2.CV_32F, 1, 0, ksize=3)
+    x[image_depth < bg_threhold] = 0
+    y = cv2.Sobel(normal_image, cv2.CV_32F, 0, 1, ksize=3)
+    y[image_depth < bg_threhold] = 0
+    z = np.ones_like(x) * np.pi * 2.0
+    normal_image = np.stack([x, y, z], axis=2)
+    normal_image /= np.sum(normal_image ** 2.0, axis=2, keepdims=True) ** 0.5
+    # rescale back to image size
+    return normal_image
+def estimate_scene_depth(image, depth_estimator):
+    #image = feature_extractor(images=image, return_tensors="pt").pixel_values.to("cuda")
+    #with torch.no_grad(), torch.autocast("cuda"):
+    #    depth_map = depth_estimator(image).predicted_depth
+    depth_map = depth_estimator(image)['predicted_depth']
+    W, H = image.size
+    depth_map = torch.nn.functional.interpolate(
+        depth_map.unsqueeze(1),
+        size=(H, W),
+        mode="bicubic",
+        align_corners=False,
+    )
+    depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
+    depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
+    depth_map = (depth_map - depth_min) / (depth_max - depth_min)
+    image = torch.cat([depth_map] * 3, dim=1)
+    image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
+    image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
+    return image
+def fill_depth_circular(depth_image, x, y, r):
+    depth_image = np.array(depth_image)
+    for i in range(depth_image.shape[0]):
+        for j in range(depth_image.shape[1]):
+            xy = (i - x - r//2)**2 + (j - y - r//2)**2
+            # if xy <= rr**2:
+            # depth_image[j, i, :] = 255
+            # depth_image[j, i, :] = int(minv + (maxv - minv) * z)
+            if xy <= (r // 2)**2:
+                depth_image[j, i, :] = 255
+    depth_image = Image.fromarray(depth_image)
+    return depth_image
+def merge_normal_map(normal_map, normal_ball,  mask_ball, x, y):
+    """
+    Merge a ball to normal map using mask
+    @params
+        normal_amp (np.array) - normal map of the scene [height, width, 3]
+        normal_ball (np.array) - normal map of the ball [ball_height, ball_width, 3]
+        mask_ball (np.array) - mask of the ball [ball_height, ball_width]
+        x (int) - x position of the ball (top-left)
+        y (int) - y position of the ball (top-left)
+    @return
+        normal_mapthe merge normal map [height, width, 3]
+    """
+    result = normal_map.copy()
+    mask_ball = mask_ball[..., None]
+    ball = (normal_ball * mask_ball) # alpha blending the ball
+    unball = (normal_map[y:y+normal_ball.shape[0], x:x+normal_ball.shape[1]] * (1 - mask_ball)) # alpha blending the normal map
+    result[y:y+normal_ball.shape[0], x:x+normal_ball.shape[1]] =  ball+unball # add them together
+    return result

relighting/inpainter.py ADDED Viewed

	@@ -0,0 +1,424 @@

+import torch
+from diffusers import ControlNetModel, AutoencoderKL
+from PIL import Image
+import numpy as np
+import os
+from tqdm.auto import tqdm
+from transformers import pipeline as transformers_pipeline
+from relighting.pipeline import CustomStableDiffusionControlNetInpaintPipeline
+from relighting.pipeline_inpaintonly import CustomStableDiffusionInpaintPipeline, CustomStableDiffusionXLInpaintPipeline
+from relighting.argument import SAMPLERS, VAE_MODELS, DEPTH_ESTIMATOR, get_control_signal_type
+from relighting.image_processor import (
+    estimate_scene_depth,
+    estimate_scene_normal,
+    merge_normal_map,
+    fill_depth_circular
+)
+from relighting.ball_processor import get_ideal_normal_ball, crop_ball
+import pickle
+from relighting.pipeline_xl import CustomStableDiffusionXLControlNetInpaintPipeline
+class NoWaterMark:
+    def apply_watermark(self, *args, **kwargs):
+        return args[0]
+class ControlSignalGenerator():
+    def __init__(self, sd_arch, control_signal_type, device):
+        self.sd_arch = sd_arch
+        self.control_signal_type = control_signal_type
+        self.device = device
+    def process_sd_depth(self, input_image, normal_ball=None, mask_ball=None, x=None, y=None, r=None):
+        if getattr(self, 'depth_estimator', None) is None:
+            self.depth_estimator = transformers_pipeline("depth-estimation", device=self.device.index)
+        control_image = self.depth_estimator(input_image)['depth']
+        control_image = np.array(control_image)
+        control_image = control_image[:, :, None]
+        control_image = np.concatenate([control_image, control_image, control_image], axis=2)
+        control_image = Image.fromarray(control_image)
+        control_image = fill_depth_circular(control_image, x, y, r)
+        return control_image
+    def process_sdxl_depth(self, input_image, normal_ball=None, mask_ball=None, x=None, y=None, r=None):
+        if getattr(self, 'depth_estimator', None) is None:
+            self.depth_estimator = transformers_pipeline("depth-estimation", model=DEPTH_ESTIMATOR, device=self.device.index)
+        control_image = estimate_scene_depth(input_image, depth_estimator=self.depth_estimator)
+        xs = [x] if not isinstance(x, list) else x
+        ys = [y] if not isinstance(y, list) else y
+        rs = [r] if not isinstance(r, list) else r
+        for x, y, r in zip(xs, ys, rs):
+            #print(f"depth at {x}, {y}, {r}")
+            control_image = fill_depth_circular(control_image, x, y, r)
+        return control_image
+    def process_sd_normal(self, input_image, normal_ball, mask_ball, x, y, r=None, normal_ball_path=None):
+        if getattr(self, 'depth_estimator', None) is None:
+            self.depth_estimator = transformers_pipeline("depth-estimation", model=DEPTH_ESTIMATOR, device=self.device.index)
+        normal_scene = estimate_scene_normal(input_image, depth_estimator=self.depth_estimator)
+        normal_image = merge_normal_map(normal_scene, normal_ball, mask_ball, x, y)
+        normal_image = (normal_image * 127.5 + 127.5).clip(0, 255).astype(np.uint8)
+        control_image = Image.fromarray(normal_image)
+        return control_image
+    def __call__(self, *args, **kwargs):
+        process_fn = getattr(self, f"process_{self.sd_arch}_{self.control_signal_type}", None)
+        if process_fn is None:
+            raise ValueError
+        else:
+            return process_fn(*args, **kwargs)
+class BallInpainter():
+    def __init__(self, pipeline, sd_arch, control_generator, disable_water_mask=True):
+        self.pipeline = pipeline
+        self.sd_arch = sd_arch
+        self.control_generator = control_generator
+        self.median = {}
+        if disable_water_mask:
+            self._disable_water_mask()
+    def _disable_water_mask(self):
+        if hasattr(self.pipeline, "watermark"):
+            self.pipeline.watermark = NoWaterMark()
+            print("Disabled watermasking")
+    @classmethod
+    def from_sd(cls,
+                model,
+                controlnet=None,
+                device=0,
+                sampler="unipc",
+                torch_dtype=torch.float16,
+                disable_water_mask=True,
+                offload=False
+    ):
+        if controlnet is not None:
+            control_signal_type = get_control_signal_type(controlnet)
+            controlnet = ControlNetModel.from_pretrained(controlnet, torch_dtype=torch.float16)
+            pipe = CustomStableDiffusionControlNetInpaintPipeline.from_pretrained(
+                model,
+                controlnet=controlnet,
+                torch_dtype=torch_dtype,
+            ).to(device)
+            control_generator = ControlSignalGenerator("sd", control_signal_type, device=device)
+        else:
+            pipe = CustomStableDiffusionInpaintPipeline.from_pretrained(
+                model,
+                torch_dtype=torch_dtype,
+            ).to(device)
+            control_generator = None
+        try:
+            if torch_dtype==torch.float16 and device != torch.device("cpu"):
+                pipe.enable_xformers_memory_efficient_attention()
+        except:
+            pass
+        pipe.set_progress_bar_config(disable=True)
+        pipe.scheduler = SAMPLERS[sampler].from_config(pipe.scheduler.config)
+        return BallInpainter(pipe, "sd", control_generator, disable_water_mask)
+    @classmethod
+    def from_sdxl(cls,
+                model,
+                controlnet=None,
+                device=0,
+                sampler="unipc",
+                torch_dtype=torch.float16,
+                disable_water_mask=True,
+                use_fixed_vae=True,
+                offload=False
+    ):
+        vae = VAE_MODELS["sdxl"]
+        vae = AutoencoderKL.from_pretrained(vae, torch_dtype=torch_dtype).to(device) if use_fixed_vae else None
+        extra_kwargs = {"vae": vae} if vae is not None else {}
+        if controlnet is not None:
+            control_signal_type = get_control_signal_type(controlnet)
+            controlnet = ControlNetModel.from_pretrained(
+                controlnet,
+                variant="fp16" if torch_dtype == torch.float16 else None,
+                use_safetensors=True,
+                torch_dtype=torch_dtype,
+            ).to(device)
+            pipe = CustomStableDiffusionXLControlNetInpaintPipeline.from_pretrained(
+                model,
+                controlnet=controlnet,
+                variant="fp16" if torch_dtype == torch.float16 else None,
+                use_safetensors=True,
+                torch_dtype=torch_dtype,
+                **extra_kwargs,
+            ).to(device)
+            control_generator = ControlSignalGenerator("sdxl", control_signal_type, device=device)
+        else:
+            pipe = CustomStableDiffusionXLInpaintPipeline.from_pretrained(
+                model,
+                variant="fp16" if torch_dtype == torch.float16 else None,
+                use_safetensors=True,
+                torch_dtype=torch_dtype,
+                **extra_kwargs,
+            ).to(device)
+            control_generator = None
+        try:
+            if torch_dtype==torch.float16 and device != torch.device("cpu"):
+                pipe.enable_xformers_memory_efficient_attention()
+        except:
+            pass
+        if offload and device != torch.device("cpu"):
+            pipe.enable_model_cpu_offload()
+        pipe.set_progress_bar_config(disable=True)
+        pipe.scheduler = SAMPLERS[sampler].from_config(pipe.scheduler.config)
+        return BallInpainter(pipe, "sdxl", control_generator, disable_water_mask)
+    # TODO: this method should be replaced by inpaint(), but we'll leave it here for now
+    # otherwise, the existing experiment code will break down
+    def __call__(self, *args, **kwargs):
+        return self.pipeline(*args, **kwargs)
+    def _default_height_width(self, height=None, width=None):
+        if (height is not None) and (width is not None):
+            return height, width
+        if self.sd_arch == "sd":
+            return (512, 512)
+        elif self.sd_arch == "sdxl":
+            return (1024, 1024)
+        else:
+            raise NotImplementedError
+    # this method is for sanity check only
+    def get_cache_control_image(self):
+        control_image = getattr(self, "cache_control_image", None)
+        return control_image
+    def prepare_control_signal(self, image, controlnet_conditioning_scale, extra_kwargs):
+        if self.control_generator is not None:
+            control_image = self.control_generator(image, **extra_kwargs)
+            controlnet_kwargs = {
+                "control_image": control_image,
+                "controlnet_conditioning_scale": controlnet_conditioning_scale
+            }
+            self.cache_control_image = control_image
+        else:
+            controlnet_kwargs = {}
+        return controlnet_kwargs
+    def get_cache_median(self, it):
+        if it in self.median: return self.median[it]
+        else: return None
+    def reset_median(self):
+        self.median = {}
+        print("Reset median")
+    def load_median(self, path):
+        if os.path.exists(path):
+            with open(path, "rb") as f:
+                self.median = pickle.load(f)
+                print(f"Loaded median from {path}")
+        else:
+            print(f"Median not found at {path}!")
+    def inpaint_iterative(
+        self,
+        prompt=None,
+        negative_prompt="",
+        num_inference_steps=30,
+        generator=None, # TODO: remove this
+        image=None,
+        mask_image=None,
+        height=None,
+        width=None,
+        controlnet_conditioning_scale=0.5,
+        num_images_per_prompt=1,
+        current_seed=0,
+        cross_attention_kwargs={},
+        strength=0.8,
+        num_iteration=2,
+        ball_per_iteration=30,
+        agg_mode="median",
+        save_intermediate=True,
+        cache_dir="./temp_inpaint_iterative",
+        disable_progress=False,
+        prompt_embeds=None,
+        pooled_prompt_embeds=None,
+        use_cache_median=False,
+        guidance_scale=5.0, # In the paper, we use guidance scale to 5.0 (same as pipeline_xl.py)
+        **extra_kwargs,
+    ):
+        def computeMedian(ball_images):
+            all = np.stack(ball_images, axis=0)
+            median = np.median(all, axis=0)
+            idx_median = np.argsort(all, axis=0)[all.shape[0]//2]
+            # print(all.shape)
+            # print(idx_median.shape)
+            return median, idx_median
+        def generate_balls(avg_image, current_strength, ball_per_iteration, current_iteration):
+            print(f"Inpainting balls for {current_iteration} iteration...")
+            controlnet_kwargs = self.prepare_control_signal(
+                image=avg_image,
+                controlnet_conditioning_scale=controlnet_conditioning_scale,
+                extra_kwargs=extra_kwargs,
+            )
+            ball_images = []
+            for i in tqdm(range(ball_per_iteration), disable=disable_progress):
+                seed = current_seed + i
+                new_generator = torch.Generator().manual_seed(seed)
+                output_image = self.pipeline(
+                    prompt=prompt,
+                    negative_prompt=negative_prompt,
+                    num_inference_steps=num_inference_steps,
+                    generator=new_generator,
+                    image=avg_image,
+                    mask_image=mask_image,
+                    height=height,
+                    width=width,
+                    num_images_per_prompt=num_images_per_prompt,
+                    strength=current_strength,
+                    newx=x,
+                    newy=y,
+                    newr=r,
+                    current_seed=seed,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    prompt_embeds=prompt_embeds,
+                    pooled_prompt_embeds=pooled_prompt_embeds,
+                    guidance_scale=guidance_scale,
+                    **controlnet_kwargs
+                ).images[0]
+                ball_image = crop_ball(output_image, mask_ball_for_crop, x, y, r)
+                ball_images.append(ball_image)
+                if save_intermediate:
+                    os.makedirs(os.path.join(cache_dir, str(current_iteration)), mode=0o777, exist_ok=True)
+                    output_image.save(os.path.join(cache_dir, str(current_iteration), f"raw_{i}.png"))
+                    Image.fromarray(ball_image).save(os.path.join(cache_dir, str(current_iteration), f"ball_{i}.png"))
+                    # chmod 777
+                    os.chmod(os.path.join(cache_dir, str(current_iteration), f"raw_{i}.png"), 0o0777)
+                    os.chmod(os.path.join(cache_dir, str(current_iteration), f"ball_{i}.png"), 0o0777)
+            return ball_images
+        if save_intermediate:
+            os.makedirs(cache_dir, exist_ok=True)
+        height, width = self._default_height_width(height, width)
+        x = extra_kwargs["x"]
+        y = extra_kwargs["y"]
+        r = 256  if "r" not in extra_kwargs else extra_kwargs["r"]
+        _, mask_ball_for_crop = get_ideal_normal_ball(size=r)
+        # generate initial average ball
+        avg_image = image
+        ball_images = generate_balls(
+            avg_image,
+            current_strength=1.0,
+            ball_per_iteration=ball_per_iteration,
+            current_iteration=0,
+        )
+        # ball refinement loop
+        image = np.array(image)
+        for it in range(1, num_iteration+1):
+            if use_cache_median and (self.get_cache_median(it) is not None):
+                print("Use existing median")
+                all = np.stack(ball_images, axis=0)
+                idx_median = self.get_cache_median(it)
+                avg_ball = all[idx_median,
+                    np.arange(idx_median.shape[0])[:, np.newaxis, np.newaxis],
+                    np.arange(idx_median.shape[1])[np.newaxis, :, np.newaxis],
+                    np.arange(idx_median.shape[2])[np.newaxis, np.newaxis, :]
+                ]
+            else:
+                avg_ball, idx_median = computeMedian(ball_images)
+                print("Add new median")
+                self.median[it] = idx_median
+            avg_image = merge_normal_map(image, avg_ball, mask_ball_for_crop, x, y)
+            avg_image = Image.fromarray(avg_image.astype(np.uint8))
+            if save_intermediate:
+                avg_image.save(os.path.join(cache_dir, f"average_{it}.png"))
+                # chmod777
+                os.chmod(os.path.join(cache_dir, f"average_{it}.png"), 0o0777)
+            ball_images = generate_balls(
+                avg_image,
+                current_strength=strength,
+                ball_per_iteration=ball_per_iteration if it < num_iteration else 1,
+                current_iteration=it,
+            )
+        # TODO: add algorithm for select the best ball
+        best_ball = ball_images[0]
+        output_image = merge_normal_map(image, best_ball, mask_ball_for_crop, x, y)
+        return Image.fromarray(output_image.astype(np.uint8))
+    def inpaint(
+        self,
+        prompt=None,
+        negative_prompt=None,
+        num_inference_steps=30,
+        generator=None,
+        image=None,
+        mask_image=None,
+        height=None,
+        width=None,
+        controlnet_conditioning_scale=0.5,
+        num_images_per_prompt=1,
+        strength=1.0,
+        current_seed=0,
+        cross_attention_kwargs={},
+        prompt_embeds=None,
+        pooled_prompt_embeds=None,
+        guidance_scale=5.0, # (same as pipeline_xl.py)
+        **extra_kwargs,
+    ):
+        height, width = self._default_height_width(height, width)
+        controlnet_kwargs = self.prepare_control_signal(
+            image=image,
+            controlnet_conditioning_scale=controlnet_conditioning_scale,
+            extra_kwargs=extra_kwargs,
+        )
+        if generator is None:
+            generator = torch.Generator().manual_seed(0)
+        output_image = self.pipeline(
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            num_inference_steps=num_inference_steps,
+            generator=generator,
+            image=image,
+            mask_image=mask_image,
+            height=height,
+            width=width,
+            num_images_per_prompt=num_images_per_prompt,
+            strength=strength,
+            newx = extra_kwargs["x"],
+            newy = extra_kwargs["y"],
+            newr = getattr(extra_kwargs, "r", 256), # default to ball_size = 256
+            current_seed=current_seed,
+            cross_attention_kwargs=cross_attention_kwargs,
+            prompt_embeds=prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            guidance_scale=guidance_scale,
+            **controlnet_kwargs
+        )
+        return output_image

relighting/mask_utils.py ADDED Viewed

	@@ -0,0 +1,124 @@

+try:
+    import cv2
+except:
+    pass
+import numpy as np
+from PIL import Image
+from relighting.ball_processor import get_ideal_normal_ball
+def create_grid(image_size, n_ball, size):
+    height, width = image_size
+    nx, ny = n_ball
+    if nx * ny == 1:
+        grid = np.array([[(height-size)//2, (width-size)//2]])
+    else:
+        height_ = np.linspace(0, height-size, nx).astype(int)
+        weight_ = np.linspace(0, width-size, ny).astype(int)
+        hh, ww = np.meshgrid(height_, weight_)
+        grid = np.stack([hh,ww], axis = -1).reshape(-1,2)
+    return grid
+class MaskGenerator():
+    def __init__(self, cache_mask=True):
+        self.cache_mask = cache_mask
+        self.all_masks = []
+    def clear_cache(self):
+        self.all_masks = []
+    def retrieve_masks(self):
+        return self.all_masks
+    def generate_grid(self, image, mask_ball, n_ball=16, size=128):
+        ball_positions = create_grid(image.size, n_ball, size)
+        # _, mask_ball = get_normal_ball(size)
+        masks = []
+        mask_template = np.zeros(image.size)
+        for x, y in ball_positions:
+            mask = mask_template.copy()
+            mask[y:y+size, x:x+size] = 255 * mask_ball
+            mask = Image.fromarray(mask.astype(np.uint8), "L")
+            masks.append(mask)
+            # if self.cache_mask:
+            #     self.all_masks.append((x, y, size))
+        return masks, ball_positions
+    def generate_single(self, image, mask_ball, x, y, size):
+        w,h = image.size # numpy as (h,w) but PIL is (w,h)
+        mask = np.zeros((h,w))
+        mask[y:y+size, x:x+size] = 255 * mask_ball
+        mask = Image.fromarray(mask.astype(np.uint8), "L")
+        return mask
+    def generate_best(self, image, mask_ball, size):
+        w,h = image.size # numpy as (h,w) but PIL is (w,h)
+        mask = np.zeros((h,w))
+        (y, x), _ = find_best_location(np.array(image), ball_size=size)
+        mask[y:y+size, x:x+size] = 255 * mask_ball
+        mask = Image.fromarray(mask.astype(np.uint8), "L")
+        return mask, (x, y)
+def get_only_high_freqency(image: np.array):
+    """
+    Get only height freqency image by subtract low freqency (using gaussian blur)
+    @params image: np.array - image in RGB format [h,w,3]
+    @return high_frequency: np.array - high freqnecy image in grayscale format [h,w]
+    """
+    # Convert to grayscale
+    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
+    # Subtract low freqency from high freqency
+    kernel_size = 11  # Adjust this according to your image size
+    high_frequency = gray - cv2.GaussianBlur(gray,(kernel_size, kernel_size), 0)
+    return high_frequency
+def find_best_location(image, ball_size=128):
+    """
+    Find the best location to place the ball (Eg. empty location)
+    @params image: np.array - image in RGB format [h,w,3]
+    @return min_pos: tuple - top left position of the best location (the location is in "Y,X" format)
+    @return min_val: float - the sum value contain in the window
+    """
+    local_variance = get_only_high_freqency(image)
+    qsum = quicksum2d(local_variance)
+    min_val = None
+    min_pos = None
+    k = ball_size
+    for i in range(k-1, qsum.shape[0]):
+        for j in range(k-1, qsum.shape[1]):
+            A = 0 if i-k < 0 else qsum[i-k, j]
+            B = 0 if j-k < 0 else qsum[i, j-k]
+            C = 0 if (i-k < 0) or (j-k < 0) else qsum[i-k, j-k]
+            sum = qsum[i, j] - A - B + C
+            if (min_val is None) or (sum < min_val):
+                min_val = sum
+                min_pos = (i-k+1, j-k+1) # get top left position
+    return min_pos, min_val
+def quicksum2d(x: np.array):
+    """
+    Quick sum algorithm to find the window that have smallest sum with O(n^2) complexity
+    @params x: np.array - image in grayscale [h,w]
+    @return q: np.array - quick sum of the image for future seach in find_best_location [h,w]
+    """
+    qsum = np.zeros(x.shape)
+    for i in range(x.shape[0]):
+        for j in range(x.shape[1]):
+            A = 0 if i-1 < 0 else qsum[i-1, j]
+            B = 0 if j-1 < 0 else qsum[i, j-1]
+            C = 0 if (i-1 < 0) or (j-1 < 0) else qsum[i-1, j-1]
+            qsum[i, j] = A + B - C + x[i, j]
+    return qsum

relighting/pipeline.py ADDED Viewed

	@@ -0,0 +1,344 @@

+import torch
+from typing import List, Union, Dict, Any, Callable, Optional, Tuple
+from diffusers.utils.torch_utils import randn_tensor, is_compiled_module
+from diffusers.models import ControlNetModel
+from diffusers.pipelines.controlnet import MultiControlNetModel
+from diffusers import StableDiffusionControlNetInpaintPipeline
+from diffusers.image_processor import PipelineImageInput
+from diffusers.pipelines.stable_diffusion.pipeline_output import StableDiffusionPipelineOutput
+from relighting.pipeline_utils import custom_prepare_latents, custom_prepare_mask_latents
+class CustomStableDiffusionControlNetInpaintPipeline(StableDiffusionControlNetInpaintPipeline):
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        image: PipelineImageInput = None,
+        mask_image: PipelineImageInput = None,
+        control_image: PipelineImageInput = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        strength: float = 1.0,
+        num_inference_steps: int = 50,
+        guidance_scale: float = 7.5,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+        callback_steps: int = 1,
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        controlnet_conditioning_scale: Union[float, List[float]] = 0.5,
+        guess_mode: bool = False,
+        control_guidance_start: Union[float, List[float]] = 0.0,
+        control_guidance_end: Union[float, List[float]] = 1.0,
+        newx: int = 0,
+        newy: int = 0,
+        newr: int = 256,
+        current_seed=0,
+        use_noise_moving=True,
+    ):
+        # OVERWRITE METHODS
+        self.prepare_mask_latents = custom_prepare_mask_latents.__get__(self, CustomStableDiffusionControlNetInpaintPipeline)
+        self.prepare_latents = custom_prepare_latents.__get__(self, CustomStableDiffusionControlNetInpaintPipeline)
+        controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet
+        # align format for control guidance
+        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
+            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
+        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
+            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
+        elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list):
+            mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetModel) else 1
+            control_guidance_start, control_guidance_end = mult * [control_guidance_start], mult * [
+                control_guidance_end
+            ]
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt,
+            control_image,
+            height,
+            width,
+            callback_steps,
+            negative_prompt,
+            prompt_embeds,
+            negative_prompt_embeds,
+            controlnet_conditioning_scale,
+            control_guidance_start,
+            control_guidance_end,
+        )
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+        if isinstance(controlnet, MultiControlNetModel) and isinstance(controlnet_conditioning_scale, float):
+            controlnet_conditioning_scale = [controlnet_conditioning_scale] * len(controlnet.nets)
+        global_pool_conditions = (
+            controlnet.config.global_pool_conditions
+            if isinstance(controlnet, ControlNetModel)
+            else controlnet.nets[0].config.global_pool_conditions
+        )
+        guess_mode = guess_mode or global_pool_conditions
+        # 3. Encode input prompt
+        text_encoder_lora_scale = (
+            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
+        )
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
+            prompt,
+            device,
+            num_images_per_prompt,
+            do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+        )
+        # For classifier free guidance, we need to do two forward passes.
+        # Here we concatenate the unconditional and text embeddings into a single batch
+        # to avoid doing two forward passes
+        if do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
+        # 4. Prepare image
+        if isinstance(controlnet, ControlNetModel):
+            control_image = self.prepare_control_image(
+                image=control_image,
+                width=width,
+                height=height,
+                batch_size=batch_size * num_images_per_prompt,
+                num_images_per_prompt=num_images_per_prompt,
+                device=device,
+                dtype=controlnet.dtype,
+                do_classifier_free_guidance=do_classifier_free_guidance,
+                guess_mode=guess_mode,
+            )
+        elif isinstance(controlnet, MultiControlNetModel):
+            control_images = []
+            for control_image_ in control_image:
+                control_image_ = self.prepare_control_image(
+                    image=control_image_,
+                    width=width,
+                    height=height,
+                    batch_size=batch_size * num_images_per_prompt,
+                    num_images_per_prompt=num_images_per_prompt,
+                    device=device,
+                    dtype=controlnet.dtype,
+                    do_classifier_free_guidance=do_classifier_free_guidance,
+                    guess_mode=guess_mode,
+                )
+                control_images.append(control_image_)
+            control_image = control_images
+        else:
+            assert False
+        # 4. Preprocess mask and image - resizes image and mask w.r.t height and width
+        init_image = self.image_processor.preprocess(image, height=height, width=width)
+        init_image = init_image.to(dtype=torch.float32)
+        mask = self.mask_processor.preprocess(mask_image, height=height, width=width)
+        masked_image = init_image * (mask < 0.5)
+        _, _, height, width = init_image.shape
+        # 5. Prepare timesteps
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps, num_inference_steps = self.get_timesteps(
+            num_inference_steps=num_inference_steps, strength=strength, device=device
+        )
+        # at which timestep to set the initial noise (n.b. 50% if strength is 0.5)
+        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
+        # create a boolean to check if the strength is set to 1. if so then initialise the latents with pure noise
+        is_strength_max = strength == 1.0
+        # 6. Prepare latent variables
+        num_channels_latents = self.vae.config.latent_channels
+        num_channels_unet = self.unet.config.in_channels
+        return_image_latents = num_channels_unet == 4
+        # EDITED HERE
+        latents_outputs = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+            image=init_image,
+            timestep=latent_timestep,
+            is_strength_max=is_strength_max,
+            return_noise=True,
+            return_image_latents=return_image_latents,
+            newx=newx,
+            newy=newy,
+            newr=newr,
+            current_seed=current_seed,
+            use_noise_moving=use_noise_moving,
+        )
+        if return_image_latents:
+            latents, noise, image_latents = latents_outputs
+        else:
+            latents, noise = latents_outputs
+        # 7. Prepare mask latent variables
+        mask, masked_image_latents = self.prepare_mask_latents(
+            mask,
+            masked_image,
+            batch_size * num_images_per_prompt,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            do_classifier_free_guidance,
+        )
+        # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 7.1 Create tensor stating which controlnets to keep
+        controlnet_keep = []
+        for i in range(len(timesteps)):
+            keeps = [
+                1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
+                for s, e in zip(control_guidance_start, control_guidance_end)
+            ]
+            controlnet_keep.append(keeps[0] if isinstance(controlnet, ControlNetModel) else keeps)
+        # 8. Denoising loop
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                # controlnet(s) inference
+                if guess_mode and do_classifier_free_guidance:
+                    # Infer ControlNet only for the conditional batch.
+                    control_model_input = latents
+                    control_model_input = self.scheduler.scale_model_input(control_model_input, t)
+                    controlnet_prompt_embeds = prompt_embeds.chunk(2)[1]
+                else:
+                    control_model_input = latent_model_input
+                    controlnet_prompt_embeds = prompt_embeds
+                if isinstance(controlnet_keep[i], list):
+                    cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
+                else:
+                    controlnet_cond_scale = controlnet_conditioning_scale
+                    if isinstance(controlnet_cond_scale, list):
+                        controlnet_cond_scale = controlnet_cond_scale[0]
+                    cond_scale = controlnet_cond_scale * controlnet_keep[i]
+                down_block_res_samples, mid_block_res_sample = self.controlnet(
+                    control_model_input,
+                    t,
+                    encoder_hidden_states=controlnet_prompt_embeds,
+                    controlnet_cond=control_image,
+                    conditioning_scale=cond_scale,
+                    guess_mode=guess_mode,
+                    return_dict=False,
+                )
+                if guess_mode and do_classifier_free_guidance:
+                    # Infered ControlNet only for the conditional batch.
+                    # To apply the output of ControlNet to both the unconditional and conditional batches,
+                    # add 0 to the unconditional batch to keep it unchanged.
+                    down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
+                    mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])
+                # predict the noise residual
+                if num_channels_unet == 9:
+                    latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
+                noise_pred = self.unet(
+                    latent_model_input,
+                    t,
+                    encoder_hidden_states=prompt_embeds,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    down_block_additional_residuals=down_block_res_samples,
+                    mid_block_additional_residual=mid_block_res_sample,
+                    return_dict=False,
+                )[0]
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                if num_channels_unet == 4:
+                    init_latents_proper = image_latents[:1]
+                    init_mask = mask[:1]
+                    if i < len(timesteps) - 1:
+                        noise_timestep = timesteps[i + 1]
+                        init_latents_proper = self.scheduler.add_noise(
+                            init_latents_proper, noise, torch.tensor([noise_timestep])
+                        )
+                    latents = (1 - init_mask) * init_latents_proper + init_mask * latents
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        callback(i, t, latents)
+        # If we do sequential model offloading, let's offload unet and controlnet
+        # manually for max memory savings
+        if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
+            self.unet.to("cpu")
+            self.controlnet.to("cpu")
+            torch.cuda.empty_cache()
+        if not output_type == "latent":
+            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
+        else:
+            image = latents
+            has_nsfw_concept = None
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image, has_nsfw_concept)
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)

relighting/pipeline_inpaintonly.py ADDED Viewed

	@@ -0,0 +1,613 @@

+import torch
+from typing import List, Union, Dict, Any, Callable, Optional, Tuple
+from diffusers.image_processor import PipelineImageInput
+from diffusers import StableDiffusionInpaintPipeline, StableDiffusionXLInpaintPipeline
+from diffusers.models import AsymmetricAutoencoderKL
+from diffusers.pipelines.stable_diffusion.pipeline_output import StableDiffusionPipelineOutput
+from diffusers.pipelines.stable_diffusion_xl.pipeline_output import StableDiffusionXLPipelineOutput
+from relighting.pipeline_utils import custom_prepare_latents, custom_prepare_mask_latents, rescale_noise_cfg
+class CustomStableDiffusionInpaintPipeline(StableDiffusionInpaintPipeline):
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        image: PipelineImageInput = None,
+        mask_image: PipelineImageInput = None,
+        masked_image_latents: torch.FloatTensor = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        strength: float = 1.0,
+        num_inference_steps: int = 50,
+        guidance_scale: float = 7.5,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+        callback_steps: int = 1,
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        newx: int = 0,
+        newy: int = 0,
+        newr: int = 256,
+        current_seed=0,
+        use_noise_moving=True,
+    ):
+        # OVERWRITE METHODS
+        self.prepare_mask_latents = custom_prepare_mask_latents.__get__(self, CustomStableDiffusionInpaintPipeline)
+        self.prepare_latents = custom_prepare_latents.__get__(self, CustomStableDiffusionInpaintPipeline)
+        # 0. Default height and width to unet
+        height = height or self.unet.config.sample_size * self.vae_scale_factor
+        width = width or self.unet.config.sample_size * self.vae_scale_factor
+        # 1. Check inputs
+        self.check_inputs(
+            prompt,
+            height,
+            width,
+            strength,
+            callback_steps,
+            negative_prompt,
+            prompt_embeds,
+            negative_prompt_embeds,
+        )
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+        # 3. Encode input prompt
+        text_encoder_lora_scale = (
+            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
+        )
+        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
+            prompt,
+            device,
+            num_images_per_prompt,
+            do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+        )
+        # For classifier free guidance, we need to do two forward passes.
+        # Here we concatenate the unconditional and text embeddings into a single batch
+        # to avoid doing two forward passes
+        if do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
+        # 4. set timesteps
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps, num_inference_steps = self.get_timesteps(
+            num_inference_steps=num_inference_steps, strength=strength, device=device
+        )
+        # check that number of inference steps is not < 1 - as this doesn't make sense
+        if num_inference_steps < 1:
+            raise ValueError(
+                f"After adjusting the num_inference_steps by strength parameter: {strength}, the number of pipeline"
+                f"steps is {num_inference_steps} which is < 1 and not appropriate for this pipeline."
+            )
+        # at which timestep to set the initial noise (n.b. 50% if strength is 0.5)
+        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
+        # create a boolean to check if the strength is set to 1. if so then initialise the latents with pure noise
+        is_strength_max = strength == 1.0
+        # 5. Preprocess mask and image
+        init_image = self.image_processor.preprocess(image, height=height, width=width)
+        init_image = init_image.to(dtype=torch.float32)
+        # 6. Prepare latent variables
+        num_channels_latents = self.vae.config.latent_channels
+        num_channels_unet = self.unet.config.in_channels
+        return_image_latents = num_channels_unet == 4
+        latents_outputs = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+            image=init_image,
+            timestep=latent_timestep,
+            is_strength_max=is_strength_max,
+            return_noise=True,
+            return_image_latents=return_image_latents,
+            newx=newx,
+            newy=newy,
+            newr=newr,
+            current_seed=current_seed,
+            use_noise_moving=use_noise_moving,
+        )
+        if return_image_latents:
+            latents, noise, image_latents = latents_outputs
+        else:
+            latents, noise = latents_outputs
+        # 7. Prepare mask latent variables
+        mask_condition = self.mask_processor.preprocess(mask_image, height=height, width=width)
+        if masked_image_latents is None:
+            masked_image = init_image * (mask_condition < 0.5)
+        else:
+            masked_image = masked_image_latents
+        mask, masked_image_latents = self.prepare_mask_latents(
+            mask_condition,
+            masked_image,
+            batch_size * num_images_per_prompt,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            do_classifier_free_guidance,
+        )
+        # 8. Check that sizes of mask, masked image and latents match
+        if num_channels_unet == 9:
+            # default case for runwayml/stable-diffusion-inpainting
+            num_channels_mask = mask.shape[1]
+            num_channels_masked_image = masked_image_latents.shape[1]
+            if num_channels_latents + num_channels_mask + num_channels_masked_image != self.unet.config.in_channels:
+                raise ValueError(
+                    f"Incorrect configuration settings! The config of `pipeline.unet`: {self.unet.config} expects"
+                    f" {self.unet.config.in_channels} but received `num_channels_latents`: {num_channels_latents} +"
+                    f" `num_channels_mask`: {num_channels_mask} + `num_channels_masked_image`: {num_channels_masked_image}"
+                    f" = {num_channels_latents+num_channels_masked_image+num_channels_mask}. Please verify the config of"
+                    " `pipeline.unet` or your `mask_image` or `image` input."
+                )
+        elif num_channels_unet != 4:
+            raise ValueError(
+                f"The unet {self.unet.__class__} should have either 4 or 9 input channels, not {self.unet.config.in_channels}."
+            )
+        # 9. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 10. Denoising loop
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                # concat latents, mask, masked_image_latents in the channel dimension
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                if num_channels_unet == 9:
+                    latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
+                # predict the noise residual
+                noise_pred = self.unet(
+                    latent_model_input,
+                    t,
+                    encoder_hidden_states=prompt_embeds,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    return_dict=False,
+                )[0]
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                if num_channels_unet == 4:
+                    init_latents_proper = image_latents[:1]
+                    init_mask = mask[:1]
+                    if i < len(timesteps) - 1:
+                        noise_timestep = timesteps[i + 1]
+                        init_latents_proper = self.scheduler.add_noise(
+                            init_latents_proper, noise, torch.tensor([noise_timestep])
+                        )
+                    latents = (1 - init_mask) * init_latents_proper + init_mask * latents
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        callback(i, t, latents)
+        if not output_type == "latent":
+            condition_kwargs = {}
+            if isinstance(self.vae, AsymmetricAutoencoderKL):
+                init_image = init_image.to(device=device, dtype=masked_image_latents.dtype)
+                init_image_condition = init_image.clone()
+                init_image = self._encode_vae_image(init_image, generator=generator)
+                mask_condition = mask_condition.to(device=device, dtype=masked_image_latents.dtype)
+                condition_kwargs = {"image": init_image_condition, "mask": mask_condition}
+            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, **condition_kwargs)[0]
+            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
+        else:
+            image = latents
+            has_nsfw_concept = None
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image, has_nsfw_concept)
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
+class CustomStableDiffusionXLInpaintPipeline(StableDiffusionXLInpaintPipeline):
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        image: PipelineImageInput = None,
+        mask_image: PipelineImageInput = None,
+        masked_image_latents: torch.FloatTensor = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        strength: float = 0.9999,
+        num_inference_steps: int = 50,
+        denoising_start: Optional[float] = None,
+        denoising_end: Optional[float] = None,
+        guidance_scale: float = 7.5,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        negative_prompt_2: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+        callback_steps: int = 1,
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        guidance_rescale: float = 0.0,
+        original_size: Tuple[int, int] = None,
+        crops_coords_top_left: Tuple[int, int] = (0, 0),
+        target_size: Tuple[int, int] = None,
+        negative_original_size: Optional[Tuple[int, int]] = None,
+        negative_crops_coords_top_left: Tuple[int, int] = (0, 0),
+        negative_target_size: Optional[Tuple[int, int]] = None,
+        aesthetic_score: float = 6.0,
+        negative_aesthetic_score: float = 2.5,
+        newx: int = 0,
+        newy: int = 0,
+        newr: int = 256,
+        current_seed=0,
+        use_noise_moving=True,
+    ):
+        # OVERWRITE METHODS
+        self.prepare_mask_latents = custom_prepare_mask_latents.__get__(self, CustomStableDiffusionXLInpaintPipeline)
+        self.prepare_latents = custom_prepare_latents.__get__(self, CustomStableDiffusionXLInpaintPipeline)
+        # 0. Default height and width to unet
+        height = height or self.unet.config.sample_size * self.vae_scale_factor
+        width = width or self.unet.config.sample_size * self.vae_scale_factor
+        # 1. Check inputs
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            height,
+            width,
+            strength,
+            callback_steps,
+            negative_prompt,
+            negative_prompt_2,
+            prompt_embeds,
+            negative_prompt_embeds,
+        )
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+        # 3. Encode input prompt
+        text_encoder_lora_scale = (
+            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
+        )
+        (
+            prompt_embeds,
+            negative_prompt_embeds,
+            pooled_prompt_embeds,
+            negative_pooled_prompt_embeds,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            do_classifier_free_guidance=do_classifier_free_guidance,
+            negative_prompt=negative_prompt,
+            negative_prompt_2=negative_prompt_2,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+        )
+        # 4. set timesteps
+        def denoising_value_valid(dnv):
+            return isinstance(denoising_end, float) and 0 < dnv < 1
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps, num_inference_steps = self.get_timesteps(
+            num_inference_steps, strength, device, denoising_start=denoising_start if denoising_value_valid else None
+        )
+        # check that number of inference steps is not < 1 - as this doesn't make sense
+        if num_inference_steps < 1:
+            raise ValueError(
+                f"After adjusting the num_inference_steps by strength parameter: {strength}, the number of pipeline"
+                f"steps is {num_inference_steps} which is < 1 and not appropriate for this pipeline."
+            )
+        # at which timestep to set the initial noise (n.b. 50% if strength is 0.5)
+        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
+        # create a boolean to check if the strength is set to 1. if so then initialise the latents with pure noise
+        is_strength_max = strength == 1.0
+        # 5. Preprocess mask and image
+        init_image = self.image_processor.preprocess(image, height=height, width=width)
+        init_image = init_image.to(dtype=torch.float32)
+        mask = self.mask_processor.preprocess(mask_image, height=height, width=width)
+        if masked_image_latents is not None:
+            masked_image = masked_image_latents
+        elif init_image.shape[1] == 4:
+            # if images are in latent space, we can't mask it
+            masked_image = None
+        else:
+            masked_image = init_image * (mask < 0.5)
+        # 6. Prepare latent variables
+        num_channels_latents = self.vae.config.latent_channels
+        num_channels_unet = self.unet.config.in_channels
+        return_image_latents = num_channels_unet == 4
+        # add_noise = True if denoising_start is None else False
+        latents_outputs = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+            image=init_image,
+            timestep=latent_timestep,
+            is_strength_max=is_strength_max,
+            return_noise=True,
+            return_image_latents=return_image_latents,
+            newx=newx,
+            newy=newy,
+            newr=newr,
+            current_seed=current_seed,
+            use_noise_moving=use_noise_moving,
+        )
+        if return_image_latents:
+            latents, noise, image_latents = latents_outputs
+        else:
+            latents, noise = latents_outputs
+        # 7. Prepare mask latent variables
+        mask, masked_image_latents = self.prepare_mask_latents(
+            mask,
+            masked_image,
+            batch_size * num_images_per_prompt,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            do_classifier_free_guidance,
+        )
+        # 8. Check that sizes of mask, masked image and latents match
+        if num_channels_unet == 9:
+            # default case for runwayml/stable-diffusion-inpainting
+            num_channels_mask = mask.shape[1]
+            num_channels_masked_image = masked_image_latents.shape[1]
+            if num_channels_latents + num_channels_mask + num_channels_masked_image != self.unet.config.in_channels:
+                raise ValueError(
+                    f"Incorrect configuration settings! The config of `pipeline.unet`: {self.unet.config} expects"
+                    f" {self.unet.config.in_channels} but received `num_channels_latents`: {num_channels_latents} +"
+                    f" `num_channels_mask`: {num_channels_mask} + `num_channels_masked_image`: {num_channels_masked_image}"
+                    f" = {num_channels_latents+num_channels_masked_image+num_channels_mask}. Please verify the config of"
+                    " `pipeline.unet` or your `mask_image` or `image` input."
+                )
+        elif num_channels_unet != 4:
+            raise ValueError(
+                f"The unet {self.unet.__class__} should have either 4 or 9 input channels, not {self.unet.config.in_channels}."
+            )
+        # 8.1 Prepare extra step kwargs.
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 9. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        height, width = latents.shape[-2:]
+        height = height * self.vae_scale_factor
+        width = width * self.vae_scale_factor
+        original_size = original_size or (height, width)
+        target_size = target_size or (height, width)
+        # 10. Prepare added time ids & embeddings
+        if negative_original_size is None:
+            negative_original_size = original_size
+        if negative_target_size is None:
+            negative_target_size = target_size
+        add_text_embeds = pooled_prompt_embeds
+        add_time_ids, add_neg_time_ids = self._get_add_time_ids(
+            original_size,
+            crops_coords_top_left,
+            target_size,
+            aesthetic_score,
+            negative_aesthetic_score,
+            negative_original_size,
+            negative_crops_coords_top_left,
+            negative_target_size,
+            dtype=prompt_embeds.dtype,
+        )
+        add_time_ids = add_time_ids.repeat(batch_size * num_images_per_prompt, 1)
+        if do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
+            add_text_embeds = torch.cat([negative_pooled_prompt_embeds, add_text_embeds], dim=0)
+            add_neg_time_ids = add_neg_time_ids.repeat(batch_size * num_images_per_prompt, 1)
+            add_time_ids = torch.cat([add_neg_time_ids, add_time_ids], dim=0)
+        prompt_embeds = prompt_embeds.to(device)
+        add_text_embeds = add_text_embeds.to(device)
+        add_time_ids = add_time_ids.to(device)
+        # 11. Denoising loop
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        if (
+            denoising_end is not None
+            and denoising_start is not None
+            and denoising_value_valid(denoising_end)
+            and denoising_value_valid(denoising_start)
+            and denoising_start >= denoising_end
+        ):
+            raise ValueError(
+                f"`denoising_start`: {denoising_start} cannot be larger than or equal to `denoising_end`: "
+                + f" {denoising_end} when using type float."
+            )
+        elif denoising_end is not None and denoising_value_valid(denoising_end):
+            discrete_timestep_cutoff = int(
+                round(
+                    self.scheduler.config.num_train_timesteps
+                    - (denoising_end * self.scheduler.config.num_train_timesteps)
+                )
+            )
+            num_inference_steps = len(list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps)))
+            timesteps = timesteps[:num_inference_steps]
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                # concat latents, mask, masked_image_latents in the channel dimension
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                if num_channels_unet == 9:
+                    latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
+                # predict the noise residual
+                added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}
+                noise_pred = self.unet(
+                    latent_model_input,
+                    t,
+                    encoder_hidden_states=prompt_embeds,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    added_cond_kwargs=added_cond_kwargs,
+                    return_dict=False,
+                )[0]
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+                if do_classifier_free_guidance and guidance_rescale > 0.0:
+                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
+                    noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                if num_channels_unet == 4:
+                    init_latents_proper = image_latents[:1]
+                    init_mask = mask[:1]
+                    if i < len(timesteps) - 1:
+                        noise_timestep = timesteps[i + 1]
+                        init_latents_proper = self.scheduler.add_noise(
+                            init_latents_proper, noise, torch.tensor([noise_timestep])
+                        )
+                    latents = (1 - init_mask) * init_latents_proper + init_mask * latents
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        callback(i, t, latents)
+        if not output_type == "latent":
+            # make sure the VAE is in float32 mode, as it overflows in float16
+            needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast
+            if needs_upcasting:
+                self.upcast_vae()
+                latents = latents.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)
+            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+            # cast back to fp16 if needed
+            if needs_upcasting:
+                self.vae.to(dtype=torch.float16)
+        else:
+            return StableDiffusionXLPipelineOutput(images=latents)
+        # apply watermark if available
+        if self.watermark is not None:
+            image = self.watermark.apply_watermark(image)
+        image = self.image_processor.postprocess(image, output_type=output_type)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image,)
+        return StableDiffusionXLPipelineOutput(images=image)

relighting/pipeline_utils.py ADDED Viewed

	@@ -0,0 +1,185 @@

+import torch
+import numpy as np
+import itertools
+from diffusers.utils.torch_utils import randn_tensor
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.rescale_noise_cfg
+def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
+    """
+    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and
+    Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
+    """
+    std_text = noise_pred_text.std(dim=list(range(1, noise_pred_text.ndim)), keepdim=True)
+    std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
+    # rescale the results from guidance (fixes overexposure)
+    noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
+    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
+    noise_cfg = guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
+    return noise_cfg
+def expand_noise(noise, shape, seed, device, dtype):
+    new_generator = torch.Generator().manual_seed(seed)
+    corner_shape = (shape[0], shape[1], shape[2] // 2, shape[3] // 2)
+    vert_border_shape = (shape[0], shape[1], shape[2], shape[3] // 2)
+    hori_border_shape = (shape[0], shape[1], shape[2] // 2, shape[3])
+    corners = [randn_tensor(corner_shape, generator=new_generator, device=device, dtype=dtype) for _ in range(4)]
+    vert_borders = [randn_tensor(vert_border_shape, generator=new_generator, device=device, dtype=dtype) for _ in range(2)]
+    hori_borders = [randn_tensor(hori_border_shape, generator=new_generator, device=device, dtype=dtype) for _ in range(2)]
+    # combine
+    big_shape = (shape[0], shape[1], shape[2] * 2, shape[3] * 2)
+    noise_template = randn_tensor(big_shape, generator=new_generator, device=device, dtype=dtype)
+    ticks = [(0, 0.25), (0.25, 0.75), (0.75, 1.0)]
+    grid = list(itertools.product(ticks, ticks))
+    noise_list = [
+        corners[0], hori_borders[0], corners[1],
+        vert_borders[0], noise, vert_borders[1],
+        corners[2], hori_borders[1], corners[3],
+    ]
+    for current_noise, ((x1, x2), (y1, y2)) in zip(noise_list, grid):
+        top_left = (int(x1 * big_shape[2]), int(y1 * big_shape[3]))
+        bottom_right = (int(x2 * big_shape[2]), int(y2 * big_shape[3]))
+        noise_template[:, :, top_left[0]:bottom_right[0], top_left[1]:bottom_right[1]] = current_noise
+    return noise_template
+def custom_prepare_latents(
+        self,
+        batch_size,
+        num_channels_latents,
+        height,
+        width,
+        dtype,
+        device,
+        generator,
+        latents=None,
+        image=None,
+        timestep=None,
+        is_strength_max=True,
+        use_noise_moving=True,
+        return_noise=False,
+        return_image_latents=False,
+        newx=0,
+        newy=0,
+        newr=256,
+        current_seed=None,
+    ):
+        shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        if (image is None or timestep is None) and not is_strength_max:
+            raise ValueError(
+                "Since strength < 1. initial latents are to be initialised as a combination of Image + Noise."
+                "However, either the image or the noise timestep has not been provided."
+            )
+        if image.shape[1] == 4:
+            image_latents = image.to(device=device, dtype=dtype)
+        elif return_image_latents or (latents is None and not is_strength_max):
+            image = image.to(device=device, dtype=dtype)
+            image_latents = self._encode_vae_image(image=image, generator=generator)
+        if latents is None and use_noise_moving:
+            # random big noise map
+            noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            noise = expand_noise(noise, shape, seed=current_seed, device=device, dtype=dtype)
+            # ensure noise is the same regardless of inpainting location (top-left corner notation)
+            newys = [newy] if not isinstance(newy, list) else newy
+            newxs = [newx] if not isinstance(newx, list) else newx
+            big_noise = noise.clone()
+            prev_noise = None
+            for newy, newx in zip(newys, newxs):
+                # find patch location within big noise map
+                sy = big_noise.shape[2] // 4 + ((512 - 128) - newy) // self.vae_scale_factor
+                sx = big_noise.shape[3] // 4 + ((512 - 128) - newx) // self.vae_scale_factor
+                if prev_noise is not None:
+                    new_noise = big_noise[:, :, sy:sy+shape[2], sx:sx+shape[3]]
+                    ball_mask = torch.zeros(shape, device=device, dtype=bool)
+                    top_left = (newy // self.vae_scale_factor, newx // self.vae_scale_factor)
+                    bottom_right = (top_left[0] + newr // self.vae_scale_factor, top_left[1] + newr // self.vae_scale_factor) # fixed ball size r = 256
+                    ball_mask[:, :, top_left[0]:bottom_right[0], top_left[1]:bottom_right[1]] = True
+                    noise = prev_noise.clone()
+                    noise[ball_mask] = new_noise[ball_mask]
+                else:
+                    noise = big_noise[:, :, sy:sy+shape[2], sx:sx+shape[3]]
+                prev_noise = noise.clone()
+            # if strength is 1. then initialise the latents to noise, else initial to image + noise
+            latents = noise if is_strength_max else self.scheduler.add_noise(image_latents, noise, timestep)
+            # if pure noise then scale the initial latents by the  Scheduler's init sigma
+            latents = latents * self.scheduler.init_noise_sigma if is_strength_max else latents
+        elif latents is None:
+            noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            latents = image_latents.to(device)
+        else:
+            noise = latents.to(device)
+            latents = noise * self.scheduler.init_noise_sigma
+        outputs = (latents,)
+        if return_noise:
+            outputs += (noise,)
+        if return_image_latents:
+            outputs += (image_latents,)
+        return outputs
+def custom_prepare_mask_latents(
+    self, mask, masked_image, batch_size, height, width, dtype, device, generator, do_classifier_free_guidance
+):
+    # resize the mask to latents shape as we concatenate the mask to the latents
+    # we do that before converting to dtype to avoid breaking in case we're using cpu_offload
+    # and half precision
+    mask = torch.nn.functional.interpolate(
+        mask, size=(height // self.vae_scale_factor, width // self.vae_scale_factor),
+        mode="bilinear", align_corners=False #PURE: We add this to avoid sharp border of the ball
+    )
+    mask = mask.to(device=device, dtype=dtype)
+    # duplicate mask and masked_image_latents for each generation per prompt, using mps friendly method
+    if mask.shape[0] < batch_size:
+        if not batch_size % mask.shape[0] == 0:
+            raise ValueError(
+                "The passed mask and the required batch size don't match. Masks are supposed to be duplicated to"
+                f" a total batch size of {batch_size}, but {mask.shape[0]} masks were passed. Make sure the number"
+                " of masks that you pass is divisible by the total requested batch size."
+            )
+        mask = mask.repeat(batch_size // mask.shape[0], 1, 1, 1)
+    mask = torch.cat([mask] * 2) if do_classifier_free_guidance else mask
+    masked_image_latents = None
+    if masked_image is not None:
+        masked_image = masked_image.to(device=device, dtype=dtype)
+        masked_image_latents = self._encode_vae_image(masked_image, generator=generator)
+        if masked_image_latents.shape[0] < batch_size:
+            if not batch_size % masked_image_latents.shape[0] == 0:
+                raise ValueError(
+                    "The passed images and the required batch size don't match. Images are supposed to be duplicated"
+                    f" to a total batch size of {batch_size}, but {masked_image_latents.shape[0]} images were passed."
+                    " Make sure the number of images that you pass is divisible by the total requested batch size."
+                )
+            masked_image_latents = masked_image_latents.repeat(
+                batch_size // masked_image_latents.shape[0], 1, 1, 1
+            )
+        masked_image_latents = (
+            torch.cat([masked_image_latents] * 2) if do_classifier_free_guidance else masked_image_latents
+        )
+        # aligning device to prevent device errors when concating it with the latent model input
+        masked_image_latents = masked_image_latents.to(device=device, dtype=dtype)
+    return mask, masked_image_latents

relighting/pipeline_xl.py ADDED Viewed

	@@ -0,0 +1,482 @@

+import torch
+from typing import List, Union, Dict, Any, Callable, Optional, Tuple
+from diffusers.utils.torch_utils import is_compiled_module
+from diffusers.models import ControlNetModel
+from diffusers.pipelines.controlnet import MultiControlNetModel
+from diffusers import StableDiffusionXLControlNetInpaintPipeline
+from diffusers.image_processor import PipelineImageInput
+from diffusers.pipelines.stable_diffusion_xl.pipeline_output import StableDiffusionXLPipelineOutput
+from relighting.pipeline_utils import custom_prepare_latents, custom_prepare_mask_latents, rescale_noise_cfg
+class CustomStableDiffusionXLControlNetInpaintPipeline(StableDiffusionXLControlNetInpaintPipeline):
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        prompt_2: Optional[Union[str, List[str]]] = None,
+        image: PipelineImageInput = None,
+        mask_image: PipelineImageInput = None,
+        control_image: Union[
+            PipelineImageInput,
+            List[PipelineImageInput],
+        ] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        strength: float = 0.9999,
+        num_inference_steps: int = 50,
+        denoising_start: Optional[float] = None,
+        denoising_end: Optional[float] = None,
+        guidance_scale: float = 5.0,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        negative_prompt_2: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+        callback_steps: int = 1,
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        controlnet_conditioning_scale: Union[float, List[float]] = 1.0,
+        guess_mode: bool = False,
+        control_guidance_start: Union[float, List[float]] = 0.0,
+        control_guidance_end: Union[float, List[float]] = 1.0,
+        guidance_rescale: float = 0.0,
+        original_size: Tuple[int, int] = None,
+        crops_coords_top_left: Tuple[int, int] = (0, 0),
+        target_size: Tuple[int, int] = None,
+        aesthetic_score: float = 6.0,
+        negative_aesthetic_score: float = 2.5,
+        newx: int = 0,
+        newy: int = 0,
+        newr: int = 256,
+        current_seed=0,
+        use_noise_moving=True,
+    ):
+        # OVERWRITE METHODS
+        self.prepare_mask_latents = custom_prepare_mask_latents.__get__(self, CustomStableDiffusionXLControlNetInpaintPipeline)
+        self.prepare_latents = custom_prepare_latents.__get__(self, CustomStableDiffusionXLControlNetInpaintPipeline)
+        controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet
+        # align format for control guidance
+        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
+            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
+        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
+            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
+        elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list):
+            mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetModel) else 1
+            control_guidance_start, control_guidance_end = mult * [control_guidance_start], mult * [
+                control_guidance_end
+            ]
+        # # 0.0 Default height and width to unet
+        # height = height or self.unet.config.sample_size * self.vae_scale_factor
+        # width = width or self.unet.config.sample_size * self.vae_scale_factor
+        # 0.1 align format for control guidance
+        if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list):
+            control_guidance_start = len(control_guidance_end) * [control_guidance_start]
+        elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list):
+            control_guidance_end = len(control_guidance_start) * [control_guidance_end]
+        elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list):
+            mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetModel) else 1
+            control_guidance_start, control_guidance_end = mult * [control_guidance_start], mult * [
+                control_guidance_end
+            ]
+        # 1. Check inputs
+        self.check_inputs(
+            prompt,
+            prompt_2,
+            control_image,
+            strength,
+            num_inference_steps,
+            callback_steps,
+            negative_prompt,
+            negative_prompt_2,
+            prompt_embeds,
+            negative_prompt_embeds,
+            pooled_prompt_embeds,
+            negative_pooled_prompt_embeds,
+            controlnet_conditioning_scale,
+            control_guidance_start,
+            control_guidance_end,
+        )
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+        if isinstance(controlnet, MultiControlNetModel) and isinstance(controlnet_conditioning_scale, float):
+            controlnet_conditioning_scale = [controlnet_conditioning_scale] * len(controlnet.nets)
+        # 3. Encode input prompt
+        text_encoder_lora_scale = (
+            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
+        )
+        (
+            prompt_embeds,
+            negative_prompt_embeds,
+            pooled_prompt_embeds,
+            negative_pooled_prompt_embeds,
+        ) = self.encode_prompt(
+            prompt=prompt,
+            prompt_2=prompt_2,
+            device=device,
+            num_images_per_prompt=num_images_per_prompt,
+            do_classifier_free_guidance=do_classifier_free_guidance,
+            negative_prompt=negative_prompt,
+            negative_prompt_2=negative_prompt_2,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            pooled_prompt_embeds=pooled_prompt_embeds,
+            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+        )
+        # 4. set timesteps
+        def denoising_value_valid(dnv):
+            return isinstance(denoising_end, float) and 0 < dnv < 1
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps, num_inference_steps = self.get_timesteps(
+            num_inference_steps, strength, device, denoising_start=denoising_start if denoising_value_valid else None
+        )
+        # check that number of inference steps is not < 1 - as this doesn't make sense
+        if num_inference_steps < 1:
+            raise ValueError(
+                f"After adjusting the num_inference_steps by strength parameter: {strength}, the number of pipeline"
+                f"steps is {num_inference_steps} which is < 1 and not appropriate for this pipeline."
+            )
+        # at which timestep to set the initial noise (n.b. 50% if strength is 0.5)
+        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
+        # create a boolean to check if the strength is set to 1. if so then initialise the latents with pure noise
+        is_strength_max = strength == 1.0
+        # 5. Preprocess mask and image - resizes image and mask w.r.t height and width
+        # 5.1 Prepare init image
+        init_image = self.image_processor.preprocess(image, height=height, width=width)
+        init_image = init_image.to(dtype=torch.float32)
+        # 5.2 Prepare control images
+        if isinstance(controlnet, ControlNetModel):
+            control_image = self.prepare_control_image(
+                image=control_image,
+                width=width,
+                height=height,
+                batch_size=batch_size * num_images_per_prompt,
+                num_images_per_prompt=num_images_per_prompt,
+                device=device,
+                dtype=controlnet.dtype,
+                do_classifier_free_guidance=do_classifier_free_guidance,
+                guess_mode=guess_mode,
+            )
+        elif isinstance(controlnet, MultiControlNetModel):
+            control_images = []
+            for control_image_ in control_image:
+                control_image_ = self.prepare_control_image(
+                    image=control_image_,
+                    width=width,
+                    height=height,
+                    batch_size=batch_size * num_images_per_prompt,
+                    num_images_per_prompt=num_images_per_prompt,
+                    device=device,
+                    dtype=controlnet.dtype,
+                    do_classifier_free_guidance=do_classifier_free_guidance,
+                    guess_mode=guess_mode,
+                )
+                control_images.append(control_image_)
+            control_image = control_images
+        else:
+            raise ValueError(f"{controlnet.__class__} is not supported.")
+        # 5.3 Prepare mask
+        mask = self.mask_processor.preprocess(mask_image, height=height, width=width)
+        masked_image = init_image * (mask < 0.5)
+        _, _, height, width = init_image.shape
+        # 6. Prepare latent variables
+        num_channels_latents = self.vae.config.latent_channels
+        num_channels_unet = self.unet.config.in_channels
+        return_image_latents = num_channels_unet == 4
+        add_noise = True if denoising_start is None else False
+        latents_outputs = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+            image=init_image,
+            timestep=latent_timestep,
+            is_strength_max=is_strength_max,
+            return_noise=True,
+            return_image_latents=return_image_latents,
+            newx=newx,
+            newy=newy,
+            newr=newr,
+            current_seed=current_seed,
+            use_noise_moving=use_noise_moving,
+        )
+        if return_image_latents:
+            latents, noise, image_latents = latents_outputs
+        else:
+            latents, noise = latents_outputs
+        # 7. Prepare mask latent variables
+        mask, masked_image_latents = self.prepare_mask_latents(
+            mask,
+            masked_image,
+            batch_size * num_images_per_prompt,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            do_classifier_free_guidance,
+        )
+        # 8. Check that sizes of mask, masked image and latents match
+        if num_channels_unet == 9:
+            # default case for runwayml/stable-diffusion-inpainting
+            num_channels_mask = mask.shape[1]
+            num_channels_masked_image = masked_image_latents.shape[1]
+            if num_channels_latents + num_channels_mask + num_channels_masked_image != self.unet.config.in_channels:
+                raise ValueError(
+                    f"Incorrect configuration settings! The config of `pipeline.unet`: {self.unet.config} expects"
+                    f" {self.unet.config.in_channels} but received `num_channels_latents`: {num_channels_latents} +"
+                    f" `num_channels_mask`: {num_channels_mask} + `num_channels_masked_image`: {num_channels_masked_image}"
+                    f" = {num_channels_latents+num_channels_masked_image+num_channels_mask}. Please verify the config of"
+                    " `pipeline.unet` or your `mask_image` or `image` input."
+                )
+        elif num_channels_unet != 4:
+            raise ValueError(
+                f"The unet {self.unet.__class__} should have either 4 or 9 input channels, not {self.unet.config.in_channels}."
+            )
+        # 8.1 Prepare extra step kwargs.
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 8.2 Create tensor stating which controlnets to keep
+        controlnet_keep = []
+        for i in range(len(timesteps)):
+            keeps = [
+                1.0 - float(i / len(timesteps) < s or (i + 1) / len(timesteps) > e)
+                for s, e in zip(control_guidance_start, control_guidance_end)
+            ]
+            if isinstance(self.controlnet, MultiControlNetModel):
+                controlnet_keep.append(keeps)
+            else:
+                controlnet_keep.append(keeps[0])
+        # 9. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        height, width = latents.shape[-2:]
+        height = height * self.vae_scale_factor
+        width = width * self.vae_scale_factor
+        original_size = original_size or (height, width)
+        target_size = target_size or (height, width)
+        # 10. Prepare added time ids & embeddings
+        add_text_embeds = pooled_prompt_embeds
+        add_time_ids, add_neg_time_ids = self._get_add_time_ids(
+            original_size,
+            crops_coords_top_left,
+            target_size,
+            aesthetic_score,
+            negative_aesthetic_score,
+            dtype=prompt_embeds.dtype,
+        )
+        add_time_ids = add_time_ids.repeat(batch_size * num_images_per_prompt, 1)
+        if do_classifier_free_guidance:
+            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
+            add_text_embeds = torch.cat([negative_pooled_prompt_embeds, add_text_embeds], dim=0)
+            add_neg_time_ids = add_neg_time_ids.repeat(batch_size * num_images_per_prompt, 1)
+            add_time_ids = torch.cat([add_neg_time_ids, add_time_ids], dim=0)
+        prompt_embeds = prompt_embeds.to(device)
+        add_text_embeds = add_text_embeds.to(device)
+        add_time_ids = add_time_ids.to(device)
+        # 11. Denoising loop
+        num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        if (
+            denoising_end is not None
+            and denoising_start is not None
+            and denoising_value_valid(denoising_end)
+            and denoising_value_valid(denoising_start)
+            and denoising_start >= denoising_end
+        ):
+            raise ValueError(
+                f"`denoising_start`: {denoising_start} cannot be larger than or equal to `denoising_end`: "
+                + f" {denoising_end} when using type float."
+            )
+        elif denoising_end is not None and denoising_value_valid(denoising_end):
+            discrete_timestep_cutoff = int(
+                round(
+                    self.scheduler.config.num_train_timesteps
+                    - (denoising_end * self.scheduler.config.num_train_timesteps)
+                )
+            )
+            num_inference_steps = len(list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps)))
+            timesteps = timesteps[:num_inference_steps]
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                # concat latents, mask, masked_image_latents in the channel dimension
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids}
+                # controlnet(s) inference
+                if guess_mode and do_classifier_free_guidance:
+                    # Infer ControlNet only for the conditional batch.
+                    control_model_input = latents
+                    control_model_input = self.scheduler.scale_model_input(control_model_input, t)
+                    controlnet_prompt_embeds = prompt_embeds.chunk(2)[1]
+                    controlnet_added_cond_kwargs = {
+                        "text_embeds": add_text_embeds.chunk(2)[1],
+                        "time_ids": add_time_ids.chunk(2)[1],
+                    }
+                else:
+                    control_model_input = latent_model_input
+                    controlnet_prompt_embeds = prompt_embeds
+                    controlnet_added_cond_kwargs = added_cond_kwargs
+                if isinstance(controlnet_keep[i], list):
+                    cond_scale = [c * s for c, s in zip(controlnet_conditioning_scale, controlnet_keep[i])]
+                else:
+                    controlnet_cond_scale = controlnet_conditioning_scale
+                    if isinstance(controlnet_cond_scale, list):
+                        controlnet_cond_scale = controlnet_cond_scale[0]
+                    cond_scale = controlnet_cond_scale * controlnet_keep[i]
+                # # Resize control_image to match the size of the input to the controlnet
+                # if control_image.shape[-2:] != control_model_input.shape[-2:]:
+                #     control_image = F.interpolate(control_image, size=control_model_input.shape[-2:], mode="bilinear", align_corners=False)
+                down_block_res_samples, mid_block_res_sample = self.controlnet(
+                    control_model_input,
+                    t,
+                    encoder_hidden_states=controlnet_prompt_embeds,
+                    controlnet_cond=control_image,
+                    conditioning_scale=cond_scale,
+                    guess_mode=guess_mode,
+                    added_cond_kwargs=controlnet_added_cond_kwargs,
+                    return_dict=False,
+                )
+                if guess_mode and do_classifier_free_guidance:
+                    # Infered ControlNet only for the conditional batch.
+                    # To apply the output of ControlNet to both the unconditional and conditional batches,
+                    # add 0 to the unconditional batch to keep it unchanged.
+                    down_block_res_samples = [torch.cat([torch.zeros_like(d), d]) for d in down_block_res_samples]
+                    mid_block_res_sample = torch.cat([torch.zeros_like(mid_block_res_sample), mid_block_res_sample])
+                if num_channels_unet == 9:
+                    latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
+                # predict the noise residual
+                noise_pred = self.unet(
+                    latent_model_input,
+                    t,
+                    encoder_hidden_states=prompt_embeds,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    down_block_additional_residuals=down_block_res_samples,
+                    mid_block_additional_residual=mid_block_res_sample,
+                    added_cond_kwargs=added_cond_kwargs,
+                    return_dict=False,
+                )[0]
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+                if do_classifier_free_guidance and guidance_rescale > 0.0:
+                    print("rescale: ", guidance_rescale)
+                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
+                    noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                if num_channels_unet == 4:
+                    init_latents_proper = image_latents[:1]
+                    init_mask = mask[:1]
+                    if i < len(timesteps) - 1:
+                        noise_timestep = timesteps[i + 1]
+                        init_latents_proper = self.scheduler.add_noise(
+                            init_latents_proper, noise, torch.tensor([noise_timestep])
+                        )
+                    latents = (1 - init_mask) * init_latents_proper + init_mask * latents
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        callback(i, t, latents)
+        # make sure the VAE is in float32 mode, as it overflows in float16
+        if self.vae.dtype == torch.float16 and self.vae.config.force_upcast:
+            self.upcast_vae()
+            latents = latents.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)
+        # If we do sequential model offloading, let's offload unet and controlnet
+        # manually for max memory savings
+        if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
+            self.unet.to("cpu")
+            self.controlnet.to("cpu")
+            torch.cuda.empty_cache()
+        if not output_type == "latent":
+            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+        else:
+            return StableDiffusionXLPipelineOutput(images=latents)
+        # apply watermark if available
+        if self.watermark is not None:
+            image = self.watermark.apply_watermark(image)
+        image = self.image_processor.postprocess(image, output_type=output_type)
+        # Offload last model to CPU
+        if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
+            self.final_offload_hook.offload()
+        if not return_dict:
+            return (image,)
+        return StableDiffusionXLPipelineOutput(images=image)

relighting/tonemapper.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import numpy as np
+class TonemapHDR(object):
+    """
+        Tonemap HDR image globally. First, we find alpha that maps the (max(numpy_img) * percentile) to max_mapping.
+        Then, we calculate I_out = alpha * I_in ^ (1/gamma)
+        input : nd.array batch of images : [H, W, C]
+        output : nd.array batch of images : [H, W, C]
+    """
+    def __init__(self, gamma=2.4, percentile=50, max_mapping=0.5):
+        self.gamma = gamma
+        self.percentile = percentile
+        self.max_mapping = max_mapping  # the value to which alpha will map the (max(numpy_img) * percentile) to
+    def __call__(self, numpy_img, clip=True, alpha=None, gamma=True):
+        if gamma:
+            power_numpy_img = np.power(numpy_img, 1 / self.gamma)
+        else:
+            power_numpy_img = numpy_img
+        non_zero = power_numpy_img > 0
+        if non_zero.any():
+            r_percentile = np.percentile(power_numpy_img[non_zero], self.percentile)
+        else:
+            r_percentile = np.percentile(power_numpy_img, self.percentile)
+        if alpha is None:
+            alpha = self.max_mapping / (r_percentile + 1e-10)
+        tonemapped_img = np.multiply(alpha, power_numpy_img)
+        if clip:
+            tonemapped_img_clip = np.clip(tonemapped_img, 0, 1)
+        return tonemapped_img_clip.astype('float32'), alpha, tonemapped_img

relighting/utils.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import argparse
+import os
+from pathlib import Path
+from PIL import Image
+import hashlib
+def str2bool(v):
+    """
+    https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse
+    """
+    if isinstance(v, bool):
+        return v
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise argparse.ArgumentTypeError("boolean value expected")
+def add_dict_to_argparser(parser, default_dict):
+    for k, v in default_dict.items():
+        v_type = type(v)
+        if v is None:
+            v_type = str
+        elif isinstance(v, bool):
+            v_type = str2bool
+        parser.add_argument(f"--{k}", default=v, type=v_type)
+def args_to_dict(args, keys):
+    return {k: getattr(args, k) for k in keys}
+def save_result(
+    image, image_path,
+    mask=None, mask_path=None,
+    normal=None, normal_path=None,
+):
+    assert isinstance(image, Image.Image)
+    os.makedirs(Path(image_path).parent, exist_ok=True)
+    image.save(image_path)
+    if (mask is not None) and (mask_path is not None):
+        assert isinstance(mask, Image.Image)
+        os.makedirs(Path(mask_path).parent, exist_ok=True)
+        mask.save(mask_path)
+    if (normal is not None) and (normal_path is not None):
+        assert isinstance(normal, Image.Image)
+        os.makedirs(Path(normal_path).parent, exist_ok=True)
+        normal.save(normal_path)
+def name2hash(name: str):
+    """
+    @see https://stackoverflow.com/questions/16008670/how-to-hash-a-string-into-8-digits
+    """
+    hash_number = int(hashlib.sha1(name.encode("utf-8")).hexdigest(), 16) % (10 ** 8)
+    return hash_number

requirements.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+# utility
+tqdm==4.66.1
+scikit-image==0.21.0
+imageio==2.31.1
+Pillow==10.2.0
+numpy==1.24.1
+natsort==8.4.0
+# EXR handling
+skylibs==0.7.4
+OpenEXR==1.3.9
+# We use pytorch pip instead because conda is mess up
+--extra-index-url https://download.pytorch.org/whl/cu118
+torch==2.0.1+cu118
+torchvision==0.15.2+cu118
+torchaudio==2.0.2+cu118
+# Diffusers dependencies
+accelerate==0.21.0
+datasets==2.13.1
+diffusers==0.21.0
+transformers==4.36.0
+xformers==0.0.20
+huggingface_hub==0.19.4