Search code examples
c++parallel-processingcudagpunpp

Does NPP support overlapping streams?


I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp informs the same:

Non-overlapping streams

That said, I'm confused if NPP supports overlapping context operations.

Below is a simplification of my code, only showing the async method calls and related variables:

std::vector<NppStreamContext> streams(n_filters);

for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
  cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
  streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
  // fill up NppStreamContext remaining fields 
  // malloc image and filter pointers
}

for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
  cudaMemcpyAsync(..., streams[stream_idx].hStream);
  nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
  cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}

for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
  cudaStreamSynchronize(streams[stream_idx].hStream);
  cudaStreamDestroy(streams[stream_idx].hStream);
}

Note: All the device pointers of the output images and input filters are stored in a std::vector, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx])


Solution

  • To summarize and add to the comments: The profile does show small overlaps, so the answer to the title question is clearly yes.

    The reason for the overlap being so small is just that each NPP kernel already needs all resources of the used GPU for most of its runtime. At the end of each kernel one can probably see the tail effect (i.e. the number of blocks is not a multiple of the number of blocks that can reside in SMs at each moment in time), so blocks from the next kernel are getting scheduled and there is some overlap.

    It can sometimes be useful (i.e. an optimization) to force overlap between a big kernel which was started first and uses the full device and a later small kernel that only needs a few resources. In that case one can use stream priorities via cudaStreamCreateWithPriority to hint the scheduler to schedule blocks from the second kernel before blocks from the first kernel. An example of this can be found in this multi-GPU example (permalink).

    In this case however, as the size of the kernels is the same and there is no reason to prioritize any of them over the others, forcing an overlap like this would not decrease the total runtime because the compute resources are limited. In the profiler view the kernels might then show more overlap but also each one would take more time. That is the reason why the scheduler does not overlap the kernels even though you allow it to do so by using multiple streams (See asynchronous vs. parallel).

    To still increase performance, one could write a custom CUDA kernel that does all the filters in one kernel launch. The main reason that this could be a better than using NPP in this case is that all NPP kernels take the same input image. Therefore a single kernel could significantly decrease the number of accesses to global memory by reading in each tile of the input image only once (to shared memory, although L1 caching might suffice), then apply all the filters sequentially or in parallel (by splitting the thread block up into smaller units) and write out the results.