How to optimize 2 identical kernels with 50% occupancy that could run concurrently in CUDA?

I have 2 identical kernels in CUDA that report 50% theoretical occupancy and could be run concurrently. However, calling them in different streams shows sequential execution.

Each kernel call has the grid and block dimensions as follows:

Grid(3, 568, 620)
Block(256, 1, 1 )
With 50 registers per thread.

This results in too many threads per SM and too many registers per block.

Should I focus my next efforts of optimization in reducing the number of registers used by the kernel?

Or does it make sense to split the grid in many smaller grids, potentially allowing for the 2 kernels to be issued and to run concurrently. Will I the number of register per block still pose an issue here?

Note - deviceQuery reports:

MAX_REGISTERS_PER_BLOCK 65K
MAX_THREADS_PER_MULTIPROCESSOR 1024
NUMBER_OF_MULTIPROCESSORS 68

Solution

I have 2 identical kernels in CUDA that report 50% theoretical occupancy ...

... and could be run concurrently

That isn't what occupancy implies and is not correct.

50% occupancy doesn't mean you have 50% unused resources which a different kernel could use concurrently. It means your code exhausted a resource when running 50% of the maximum theoretical number of concurrent warps. If you have exhausted a resource, you can't run any more warps, be they from that kernel or any other.

However, calling them in different streams shows sequential execution.

That is exactly what should be expected, for the reasons above

Each kernel call has the grid and block dimensions as follows:

Grid(3, 568, 620)
Block(256, 1, 1 )
With 50 registers per thread.

You gave a kernel which launches 1041600 blocks. That is several orders of magnitude more than even the largest GPUs can run concurrently, meaning that scope for concurrent kernel execution for such an enormous grid is basically zero.

This results in too many threads per SM and too many registers per block.

The register pressure is probably what is limiting the occupancy

Should I focus my next efforts of optimization in reducing the number of registers used by the kernel?

Given that the goal of concurrent kernel execution is impossible, I would think the objective should be to make this kernel run as fast as possible. How you do that is code specific. In some cases register optimization can increase occupancy and performance, but sometimes all that happens is you get spills to local memory which hurts performance.

Or does it make sense to split the grid in many smaller grids, potentially allowing for the 2 kernels to be issued and to run concurrently.

When you say "many" you would be implying thousands of grids, and that would imply so much launch and scheduling latency that I could not imagine any benefit in doing so, if you could manage to get to the point where concurrent kernel execution was possible.