CUDA kernel fails to launch after making simple code changes inside the kernel

I have a templated CUDA kernel for calculating and setting values at the interface between 2 computational meshes. The values are calculated using 3 separate contributions, obtained from class member functions with class instances passed to the kernel. If I obtain any one of these contributions alone to set in the output the kernel works. As soon as I add 2 (or all) of these contributions to set in the output the kernel simply does not launch at all.

I've inserted the full kernel code at the end, but I'll try to exemplify the above first.

First define the first 2 contributions:

//contribution 1
VType value1 = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)) / (b_val_sec + b_val_pri);
//contribution 2
VType value2 = (Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR) / (b_val_sec + b_val_pri);

Now set output:

Case 1 - kernel launches and sets expected values:

V_pri[cell1_idx] = value1;

Case 2 - kernel launches and sets expected values:

V_pri[cell1_idx] = value2;

Case 3 - kernel does not launch:

V_pri[cell1_idx] = value1 + value2;

I am completely stumped as this seems to defy logic and would really like to understand what is happening. Has anyone encountered anything similar, or any idea what could be causing this?

I'm using CUDA 9.2 with Visual Studio 2017 and I've tested the code on GTX 980 Ti (compute 5.2) and GTX 1060 (compute 6.1) with identical results.

Here is the full kernel code:

template <typename VType, typename Class_CMBND>
__global__ void set_cmbnd_values_kernel(
cuVEC_VC<VType>& V_sec, cuVEC_VC<VType>& V_pri,
Class_CMBND& cmbndFuncs_sec, Class_CMBND& cmbndFuncs_pri,
CMBNDInfoCUDA& contact)
{

 int box_idx = blockIdx.x * blockDim.x + threadIdx.x;

 cuINT3 box_sizes = contact.cells_box.size();

 if (box_idx < box_sizes.dim()) {

     int i = (box_idx % box_sizes.x) + contact.cells_box.s.i;
     int j = ((box_idx / box_sizes.x) % box_sizes.y) + contact.cells_box.s.j;
     int k = (box_idx / (box_sizes.x * box_sizes.y)) + contact.cells_box.s.k;

     cuReal hL = contact.hshift_secondary.norm();
     cuReal hR = contact.hshift_primary.norm();
     cuReal hmax = (hL > hR ? hL : hR);

     int cell1_idx = i + j * V_pri.n.x + k * V_pri.n.x*V_pri.n.y;

     if (V_pri.is_empty(cell1_idx) || V_pri.is_not_cmbnd(cell1_idx)) return;

     int cell2_idx = (i + contact.cell_shift.i) + (j + contact.cell_shift.j) * V_pri.n.x + (k + contact.cell_shift.k) * V_pri.n.x*V_pri.n.y;

     cuReal3 relpos_m1 = V_pri.rect.s - V_sec.rect.s + ((cuReal3(i, j, k) + cuReal3(0.5)) & V_pri.h) + (contact.hshift_primary + contact.hshift_secondary) / 2;

     cuReal3 stencil = V_pri.h - cu_mod(contact.hshift_primary) + cu_mod(contact.hshift_secondary);

     VType V_2 = V_pri[cell2_idx];
     VType V_m2 = V_sec.weighted_average(relpos_m1 + contact.hshift_secondary, stencil);

     //a values
     VType a_val_sec = cmbndFuncs_sec.a_func_sec(relpos_m1, contact.hshift_secondary, stencil);
     VType a_val_pri = cmbndFuncs_pri.a_func_pri(cell1_idx, cell2_idx, contact.hshift_secondary);

     //b values adjusted with weights
     cuReal b_val_sec = cmbndFuncs_sec.b_func_sec(relpos_m1, contact.hshift_secondary, stencil) * contact.weights.i;
     cuReal b_val_pri = cmbndFuncs_pri.b_func_pri(cell1_idx, cell2_idx) * contact.weights.j;

     //V'' values at cell positions -1 and 1
     VType Vdiff2_sec = cmbndFuncs_sec.diff2_sec(relpos_m1, stencil);
     VType Vdiff2_pri = cmbndFuncs_pri.diff2_pri(cell1_idx);

     //Formula for V1
     V_pri[cell1_idx] = (V_m2 * 2 * b_val_sec / 3 + V_2 * (b_val_pri + b_val_sec / 3)
        - Vdiff2_sec * b_val_sec * hL * hL - Vdiff2_pri * b_val_pri * hR * hR
        + (a_val_pri - a_val_sec) * hmax) / (b_val_sec + b_val_pri);
 }
}

It's almost as if kernels with too many lines of code in them (in the above kernel there's additional code in the various functions used) fail to launch under certain conditions.

Solution

Right, seems I found the answer to my problem.

Looking at the generated errors I get "Too many resources requested for launch".

I've reduced the number of threads per block from 512 to 256 and the kernel runs fine now.