Search code examples
gpunvidiadecodingnvml

Load Balancing Challenges with NVIDIA GPUs in CCTV Video Decoding


We have a CCTV system where we use NVIDIA GPUs for video decoding. Our current requirement is to monitor GPU decoding and memory usage, and if the usage reaches 80%, we need to automatically switch new streams to the next available GPU.

We have implemented GPU monitoring using NVML, but when multiple streams are initiated simultaneously, they all tend to go to the same GPU. We are looking for an effective strategy or best practices to distribute the streams evenly across multiple GPUs when they are opened concurrently.

Any advice or suggestions on how to achieve this load balancing effectively would be greatly appreciated.

Thank you!


Solution

  • Don't monitor - estimate the load. If you try to measure, you will find that the reported load is fluctuating heavily due to various external factors (e.g. stalled uploads delaying the decoder, accidentally sampling utilization just in between frames etc.), and you will almost certainly under- / overshoot the intended load level.

    The load is almost proportional to the frame rate and video resolution - the later one rounded up to multiple of 128 pixels in both dimensions. This rounding up is due to an undocumented implementation detail of the video decoder, it's processing videos in tiles of this granularity.

    Bitrate or specific encoding details (used or unused codec features) have little to no impact at all. You do have a correction factor for entire codec families only (e.g. H264 vs H265 vs VC-1 vs VP9), but they all compete for resources from the same pool, so you can sum them up trivially.

    You have the same amount of resources available on all models of a generation, it does not scale with clock speeds. Only exception is when the unit explicitly has multiple video decoder units in a single chip, in that case you can simply multiply the available decoding budget. There has been actually very little difference between GPUs since the introduction of the Pascal family (3rd gen NVDEC) either, only the feature set has been extended but not the performance per unit.

    Check https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new , in the rows "Total # of NVDEC" and "NVDEC Generation". That's the only determining factors for the available throughput.

    You will have to make an reference measurement for your used GPU families to determine a reference value in "pixels per second" peak throughput rate for the video codec relevant to you - I can no longer recall what the exact numbers were. Use a single 4k video stream for the reference measurement, as it will scale slightly worse than a bunch of concurrent lower resolution streams.

    You can generally run the video decoder unit at up to 95% of the such measured peak througput rate without loosing real-time decoding capabilities.

    Video decoder throughput is independent from compute or graphics loads on the shader units.

    Don't try to apply this logic to any of the models with a 64bit GDDR4 or slower memory interface - they don't have enough memory bandwidth in order to achieve full throughout on the decoder unit. Likewise, may generally want to avoid saturating the memory bandwidth by shader work, both will stall the video decoder unit.

    We are looking for an effective strategy or best practices to distribute the streams evenly across multiple GPUs when they are opened concurrently.

    There is really no benefit to distributing eagerly. You will find that if you correctly predict the utilization such that it is guaranteed to stay below 100%, you will achieve the same user experience but at a lower power consumption.