Pycuda Concurrency

Why would the code from PyCuda KernelConcurrency Example not run faster in 'concurrent' mode? It seems like there should be enough resources on my GPU... what am I missing?

Here is the output from the 'concurrent' version, with line 63 uncommented:

=== Device attributes
Name: GeForce GTX 980
Compute capability: (5, 2)
Concurrent Kernels: True

=== Checking answers
Dataset 0 : passed.
Dataset 1 : passed.

=== Timing info (for last set of kernel launches)
Dataset 0
kernel_begin : 1.68524801731
kernel_end : 1.77305603027
Dataset 1
kernel_begin : 1.7144639492
kernel_end : 1.80246400833

Here is the version with line 63 commented out. This should be no longer running concurrently, and should be significantly slower. It looks nearly the same to me (about 0.08 - 0.09 in both cases):

=== Device attributes
Name: GeForce GTX 980
Compute capability: (5, 2)
Concurrent Kernels: True

=== Checking answers
Dataset 0 : passed.
Dataset 1 : passed.

=== Timing info (for last set of kernel launches)
Dataset 0
kernel_begin : 1.20230400562
kernel_end : 1.28966403008
Dataset 1
kernel_begin : 1.21827197075
kernel_end : 1.30672001839

Is there something I'm missing here? Is there another way to test concurrency?

Solution

The only way to truly see what is happening with concurrent kernel execution is to profile the code.

With the inner kernel launch loop as posted on the wiki:

# Run kernels many times, we will only keep data from last loop iteration.
for j in range(10):
    for k in range(n):
        event[k]['kernel_begin'].record(stream[k])
        my_kernel(d_data[k], block=(N,1,1), stream=stream[k]) 
    for k in range(n): # Commenting out this line should break concurrency.
        event[k]['kernel_end'].record(stream[k])

the profile trace looks like this:

With the inner kernel launch loop like this (i.e. the kernel end events not pushed onto the stream within their own loop:

# Run kernels many times, we will only keep data from last loop iteration.
for j in range(10):
    for k in range(n):
        event[k]['kernel_begin'].record(stream[k])
        my_kernel(d_data[k], block=(N,1,1), stream=stream[k]) 
#    for k in range(n): # Commenting out this line should break concurrency.
        event[k]['kernel_end'].record(stream[k])

I get this profile:

i.e. the kernels in the two execution streams are still overlapping.

So the reason why the execution time doesn't change between the two examples is because the comment you are relying on is erroneous. Both cases yield kernel execution overlap ("concurrency").

I have no interest in understanding why that is the case, but that is the source of your confusion. You will need to look elsewhere for the source of poor performance in your code (which apparently doesn't use streams anyway so this entire question was a straw man).