CUDA, Copying to shared memory increases number of registers used dramatically

I am having an issue with kernel launch failure because too many resources are requested. I understand the error and that I can reduce my block size to avoid it but I am trying to get around that.

I am working with a Nvidia Tesla K40c GPU. I am using pycuda to solve a system of PDEs. So, my goal is to do some local computation with each thread and then write into a shared memory array. I am fairly new to GPU computing but what I do know of the problem at hand is written below. This issue has to do with the commented out line of code in the snip below. I know that shared memory is ideal for inter-thread communication in a block and my shared memory works correctly until I try to write into it from a local variable which I am assuming is stored in registers. I am assuming this because I read that arrays less than a particular size, 16 floats if I remember correctly, MAY be stored in registers. Mine are of size 4. This is the goal anyways to avoid storing in global.

__device__
void step(float *shared_state, int idx)
{
  float dfdxy[NVC]={0};
  get_dfdx(dfdxy,shared_state,idx);
  get_dfdy(dfdxy,shared_state,idx);
  __syncthreads();
  //shared_state[idx+0*SGIDS] += dfdxy[0];
}

Here is the trace. As I mentioned, I am familiar with the error.

Traceback (most recent call last):
  File "./src/sweep/sweep.py", line 325, in <module>
    sweep(arr0,targs,order,block_size,euler.step,0)
  File "./src/sweep/sweep.py", line 109, in sweep
    gpu_speed(arr, source_mod, cpu_fcn, block_size,ops,num_tries=20)
  File "./src/sweep/sweep.py", line 175, in gpu_speed
    gpu_fcn(arr_gpu,grid=grid_size, block=block_size,shared=shared_size)
  File "/home/walkanth/.conda/envs/pysweep/lib/python3.6/site-packages/pycuda/driver.py", line 402, in function_call
    func._launch_kernel(grid, block, arg_buf, shared, None)
pycuda._driver.LaunchError: cuLaunchKernel failed: too many resources requested for launch

The issue is specifically this, when I run the code with the line commented. It says that I am using 32 registers. This is good, everything works because I am below the limit of 63.

However, when I uncomment the line, the number of registers used jumps up to 70 and I suspect this is why the kernel launch fails.

So, a couple questions.

First, can anyone explain why this is happening? I have been searching for sometime and have fallen short.

Second, if there is not a way around this. Does anyone know of some tips to reduce my register usage aside from decreasing the block_size? I have seen some older threads on nvidia dev talk about this but they seem to be pretty dated.

Edit:

So thanks to Michael on this post, I found out that the GPU I have actually has 255 registers per thread. So, the registers are not the problem. However, that makes me unsure of where the issue is coming from though.

I thought it also beneficial to include that I am not using any specific compiler options. I tried -ptxas at one point but it did not change much.

I don't want to reduce the blocksize block size because the number of calculations that I can make prior to needing outside information is dependent on the blocksize's minimum dimension (x or y). The larger the blocksize, the more calculations possible.

Edit: So, to my understanding, I am still exceeding the total number of registers per SM which is cause for the issue. I need to reduce the registers used or the block size.

Solution

The compiler is going to try to optimize the number of register instructions automatically; if you've written code that ultimately isn't storing information anywhere outside of a thread, then those instructions simply shouldn't be generated. That's probably why you're seeing a large change in the number of registers when you uncomment the line that writes to shared memory.

However, according to https://developer.nvidia.com/cuda-gpus, the K40c is compute capability 3.5, and according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities, devices that are compute capability 3.5 can have up to 255 registers per thread, not 63. Thus, if you still only use 70 registers per thread then this probably isn't the issue. This is confirmed if you no longer get the error by reducing the block size; a reduction in block size reduces the number of threads in the block but shouldn't alter how many registers are used per thread, so it shouldn't fix your problem if you were actually running out of registers per thread.

Without further knowledge of your compiler options, the rest of your kernel, and how you're launching it, we can't easily ascertain what the resource problem is. There are also limits on the number of registers per block and the number of registers per multiprocessor; if reducing the block size fixes the problem then it's probable that you're exceeding these thresholds... and need to reduce the block size. It's unclear why you don't want to reduce your block size, but it seems like you're just running up against a hardware limitation.