performance opengl image-processing gpu hpc

Techniques for optimising GPU utilisation processing discrete images

I have a server which is applying filters (implemented as OpenGL shaders) to images. They are mostly direct colour mappings but also occasionally blurs and other convolutions.

The source images are PNGs and JPGs in a variety of sizes from e.g. 100x100 pixels upto 16,384x16,384 (texture size for my GPU).

The pipeline is:

Decode image to RGBA (CPU)
        |
        V
Load texture to GPU
        |
        V
   Apply shader (GPU)
        |
        V
Unload to CPU memory
        |
        V
  Encode to PNG (CPU)

The mean GPU timings are approx 0.75ms to load, 1.5ms to unload and 1.5 ms to process a texture.

I have multiple CPU threads decoding PNGs and JPGs to provide a continuous stream of work to the GPU.

The challenge is that watch -n 0.1 nvidia-smi reports that the GPU utilisation is largely about 0% - 1%, spiking to 18% periodically.

I really want to be getting more value out of the GPU, ie I'd like to see it's load at least around 50%. My questions:

Is nvidia-smi giving a reasonable representation of how busy the GPU is? Does it for example include time to load and unload textures? If not, is there a better metric I could be using.
Assuming that it is, and the GPU is sitting back doing nothing, are there any well understood architectures for increasing throughput? I've considered tiling multiple images into a large texture but this feels like it'll blow out CPU usage rather than GPU.
Is there someway I could be loading the next image to GPU texture memory while the GPU is processing the previous image?

Solution

Sampling nvidia-smi is a really poor way of figuring out utilization. Use Nvidia Visual Profiler (I find this easiest to work with) or Nvidia Nsight to get a true picture of what your performance and bottlenecks are.

It's hard to say how to improve performance without seeing your code and without you having a better understanding of what the bottleneck is.

You say you have multiple CPU threads going, but do you have multiple CUDA streams so you can hide the latency of data transfer? This allows you to load data into the GPU while it is processing.
Are you sure you have threads and not processes? Threads might reduce overhead.
Applying a single shader on the GPU will take almost no time, so your pipeline might ultimately be limited by your hard drive's speed or your bus speed. Have you looked up this specs, measured the size of your images, and found a theoretical value for your maximum processing capability? Your GPU is likely to spend a lot of time being idle unless you're doing a lot of complicated math on it.