I am using this sample python code from Hugging Face
from diffusers import DiffusionPipeline
import torch
# load both base & refiner
base = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base.text_encoder_2,
vae=base.vae,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
)
refiner.to("cuda")
# Define how many steps and what % of steps to be run on each experts (80/20) here
n_steps = 40
high_noise_frac = 0.8
prompt = "A majestic lion jumping from a big stone at night"
# run both experts
image = base(
prompt=prompt,
num_inference_steps=n_steps,
denoising_end=high_noise_frac,
output_type="latent",
).images
image = refiner(
prompt=prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
image=image,
).images[0]
image
But every time the image is about the be generated, the Google Colab session crashes. Has anyone successfully tried the Base + Refiner in Google Colab? I have successfully tried only the Base but I want to try both at the same time.
Seems I can't reply to myself so here's a new Answer.
Yes it can be done. I've done a fair bit of work on the code I used above and have got a version now that runs well within the free colab limits, uses 8.9Gb of VRAM, 5Gb of Ram and 38Gb of storage. So that should stop the crashing,
The main improvements were the use of the 16bit VAE which saved a lot of VRAM on decoding and not re-using the text_encoder from the base pipe which seems to free up another 3gb of VRAM.
I did try to use the CPU offloading to be able of have both models in memory even it it was CPU and GPU memory but that just used up all the system RAM so settled for doing the base run for all images followed by doing the refiner run for all images.
I've stuck in a Gist
https://gist.github.com/Vargol/ae56f6c1bd825523d028a5925b4b1dad
It also can generate more than one image at a time and uses the same styles as the various bots and Clipdrop, the "No Style" style is called Enhance.
As before the generation part, the last cell, is re-runnable so it doesn't go through the setup over and over if you what to generate more images.