I have a server which is applying filters (implemented as OpenGL shaders) to images. They are mostly direct colour mappings but also occasionally blurs and other convolutions.
The source images are PNGs and JPGs in a variety of sizes from e.g. 100x100 pixels upto 16,384x16,384 (texture size for my GPU).
The pipeline is:
Decode image to RGBA (CPU)
|
V
Load texture to GPU
|
V
Apply shader (GPU)
|
V
Unload to CPU memory
|
V
Encode to PNG (CPU)
The mean GPU timings are approx 0.75ms to load, 1.5ms to unload and 1.5 ms to process a texture.
I have multiple CPU threads decoding PNGs and JPGs to provide a continuous stream of work to the GPU.
The challenge is that watch -n 0.1 nvidia-smi
reports that the GPU utilisation is largely about 0% - 1%, spiking to 18% periodically.
I really want to be getting more value out of the GPU, ie I'd like to see it's load at least around 50%. My questions:
Is nvidia-smi
giving a reasonable representation of how busy the GPU is? Does it for example include time to load and unload textures? If not, is there a better metric I could be using.
Assuming that it is, and the GPU is sitting back doing nothing, are there any well understood architectures for increasing throughput? I've considered tiling multiple images into a large texture but this feels like it'll blow out CPU usage rather than GPU.
Is there someway I could be loading the next image to GPU texture memory while the GPU is processing the previous image?
Sampling nvidia-smi
is a really poor way of figuring out utilization. Use Nvidia Visual Profiler (I find this easiest to work with) or Nvidia Nsight to get a true picture of what your performance and bottlenecks are.
It's hard to say how to improve performance without seeing your code and without you having a better understanding of what the bottleneck is.