Multiple threads of Hugging face Stable diffusion Inpainting pipeline slows down the inference on same GPU

I am using Stable diffusion inpainting pipeline to generate some inference results on a A100 (40 GB) GPU. For a 512X512 image it is taking approx 3 s per image and takes about 5 GB of space on the GPU.

In order to have faster inference, I am trying to run 2 threads (2 inference scripts). However, as soon as I start them simultaneously. The inference time decreases to ~6 sec per thread with an effective time of ~3 s per image.

I am unable to understand why this is so. I still have a lot of space available (about 35 GB) on GPU and quite a big CPU ram of 32 GB.

Can someone help me in this regard?

Solution

Regardless of the VRAM requirements, if the stable diffusion model is using most of the SMs on the GPU there is no hardware left able to parallelize the inference of two images in the same GPU.