2D kernel calling and launch parameters for non-square matrix

I am attempting to port the following (simplified) nested loop as a CUDA 2D kernel. The sizes of NgS and NgO will increase with larger data sets; for now I just want to get this kernel to output the correct results for all values:

// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )

int NgS  = 1859;
int NgO  = 900;

// 1D flattened matrices have been initialized as:
 Radio_cpu = new double [NgS*NgO];
Result_cpu = new double [NgS*NgO];
// ignoring the part where they are filled w/ data

for (m=0; m<NgO; m++) {        
    for (n=0; n<NgS; n++) {
            Result_cpu[idx(n,m,NgO)]] = k0*Radio_cpu[idx(n,m,NgO)]];
    }
}

The examples I have come across usually deal with square loops, and I have been unable to get the correct output for all the GPU array indices compared to the CPU version. Here is the host code calling the kernel:

dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;

// Result_gpu and Radio_gpu are allocated versions of the CPU variables on GPU
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, Radio_gpu, Result_gpu);

Here is the kernel:

__global__ void trans(int NgO, int NgS,
                      double k0, double * Radio, double * Result) {

int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;

if(n > NgS || m > NgO) return;

// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int idxxx = m + (n * grid_width);

Result[idxxx] = k0 * Radio[idxxx];
}

With the current code, I proceeded to compare the Result_cpu variable with Result_gpu variable once copied back. When I cycle through the values I get:

    // matches from NgS = 0...913
    Result_gpu[NgS = 913][NgO = 0]: -56887.2
    Result_cpu[Ngs = 913][NgO = 0]: -56887.2

    // mismatches from NgS = 914...1858
    Result_gpu[NgS = 914][NgO = 0]: -12.2352
    Result_cpu[NgS = 914][NgO = 0]: 79448.6

This pattern is the same, irregardless of the value of NgO. I have been trying to figure out where I have made a mistake by looking at various examples for a few hours and trying out changes, but so far this scheme has worked minus the obvious issue at hand whereas the others have caused kernel invocation errors/left the GPU array uninitialized for all values. Since I clearly cannot see the mistake, I'd really appreciate if someone could point me in the right direction towards a fix. I'm pretty sure it's right under my nose and I can't see it.

In case it matters, I'm testing this code on a Kepler card, compiling using MSVC 2010, CUDA 4.2 and 304.79 driver and have compiled the code with both arch=compute_20,code=sm_20 and arch=compute_30,code=compute_30 flags with no difference.

Solution

@vaca_loca: I tested the following kernel (it works for me also with non-square block dimensions):

__global__ void trans(int NgO, int NgS,
                  double k0, double * Radio, double * Result) {

int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgO || m > NgS) return;
int ofs = m * NgO + n;
Result[ofs] = k0 * Radio[ofs];
}

void test() {

int NgS  = 1859, NgO  = 900;
int data_sz = NgS * NgO, bytes = data_sz * sizeof(double);
cudaSetDevice(0);
double *Radio_cpu = new double [data_sz*3],
    *Result_cpu = Radio_cpu + data_sz,
    *Result_gpu = Result_cpu + data_sz;
double k0 = -1.7961233;

srand48(time(NULL));
int i, j, n, m;
for(m=0; m<NgO; m++) {
  for (n=0; n<NgS; n++) {
        Radio_cpu[m + n*NgO] = lrand48() % 234234;
        Result_cpu[m + n*NgO] = k0*Radio_cpu[m + n*NgO];
    }
}

double *g_Radio, *g_Result;
cudaMalloc((void **)&g_Radio, bytes * 2);
g_Result = g_Radio + data_sz;
cudaMemcpy(g_Radio, Radio_cpu, bytes, cudaMemcpyHostToDevice);

dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;

trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, g_Radio, g_Result);

cudaMemcpy(Result_gpu, g_Result, bytes, cudaMemcpyDeviceToHost);

for(m=0; m<NgO; m++) {
    for (n=0; n<NgS; n++) {
        double c1 = Result_cpu[m + n*NgO],
                c2 = Result_gpu[m + n*NgO];
        if(std::abs(c1-c2) > 1e-4)
            printf("(%d;%d): %.7f %.7f\n", n, m, c1, c2);
    }
}
cudaFree(g_Radio);
delete []Radio_cpu;
}

though, in my opinion, accessing data from global memory using quads might not be very cache-friendly since access stride is pretty large. You might consider using 2D textures instead if it's critical for your algorithm to access data in 2D locality