parallel-processing cuda emgucv opencv3.1 managed-cuda

Advantage of using a CUDA Stream

I am trying to understand where a Stream might help me with processing multiple Regions of Interest on a video frame. If using NPP functions that support a stream, is this a case where one would launch as many streams as there are ROIs? Possibly even creating a CPU thread for each Stream? Or is the benefit in using one stream to process all the ROIs and possibly using this single stream from multiple threads in the CPU?

Solution

In CUDA, usage of streams generally helps to better utilize GPU in two ways. Firstly, memory copies between host and device can be overlapped by kernel execution if copying and execution occur in different streams. Secondly, individual kernels running in different streams can overlap if there are enough resources on the GPU.

Further, whether creating a thread for each ROI would help depends on comparison of GPU vs CPU (if any) utilization. If there is a lot of processing on CPU and CPU holds off GPU computation, creating more threads helps.

There are further details (see the documentation for actual version of CUDA) which constrain overlapping of operations in the streams. A memory copy overlaps with a kernel execution only if memory source or destination in RAM is page-locked. Or, synchronization between streams occurs when host thread issues command(s) in the default stream. (Since CUDA 7 each thread has its own default stream, so processing ROIs in different threads would help again.)

Hence, satisfying certain conditions, it should improve performance of your algorithm if the processing of ROIs occurs in different streams up to certain limit (depending on resource consumption of the kernels, ratio of memory copies and computation, etc...)