How to achieve memory coalescing when iterating over non-square 2D arrays?

I'm struggling to figure out memory coalescence in CUDA. In order evaluate the performance difference between coalesced and uncoalesced memory accesses I have implemented two different versions of a kernel that adds two 2D matrices:

from numba import cuda

@cuda.jit
def uncoalesced_matrix_add(a, b, out):
    x, y = cuda.grid(2)
    out[x][y] = a[x][y] + b[x][y]

@cuda.jit
def coalesced_matrix_add(a, b, out):
    x, y = cuda.grid(2)
    out[y][x] = a[y][x] + b[y][x]

When I test the code above with square matrices everything works fine, I mean, both kernels produce the same result and the coalesced version is significantly faster:

import numpy as np

nrows, ncols = 2048, 2048
tpb = 32

threads_per_block = (tpb, tpb)
blocks = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)

size = nrows*ncols

a = np.arange(size).reshape(nrows, ncols).astype(np.int32)
b = np.ones(shape=a.shape, dtype=np.int32)
out = np.empty_like(a).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

uncoalesced_matrix_add[blocks, threads_per_block](d_a, d_b, d_out)
slow = d_out.copy_to_host()

coalesced_matrix_add[blocks, threads_per_block](d_a, d_b, d_out)
fast = d_out.copy_to_host()

np.array_equal(slow, fast)
# True

However, if I change nrows = 1024 so that the matrices are no longer square, coalesced_matrix_add() throws the following error:

CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

What am I missing here?

Edit

For completeness, I'm attaching some profiling. These data were obtained by using the workaround proposed by @Robert Crovella with nrows = 1024 and ncols = 2048:

In [40]: %timeit uncoalesced_matrix_add[blocksu, threads_per_block](d_a, d_b, d_out)
289 µs ± 498 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [41]: %timeit coalesced_matrix_add[blocksc, threads_per_block](d_a, d_b, d_out)
164 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Solution

when you make your array non-square, this calculation is no longer correct, for one of your two kernels:

blocks = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)

For both thread and block dimensions, the x index dimension comes first, followed by y index dimension (referring to the x and y as they appear in the in-kernel built-in variables x and y).

For the usage in your first kernel:

out[x][y] = a[x][y] + b[x][y]

we want x to index through the rows. That is consistent with your grid definition.

For the usage in your second kernel:

out[y][x] = a[y][x] + b[y][x]

we want y to index through the rows. That is not consistent with your grid definition.

The result is out-of-bounds access on the second kernel call. The orientation of your rectangular grid does not match the orientation of your rectangular data.

In the square case, such reversal was immaterial, as both dimensions were the same.

Here is a possible "fix":

$ cat t62.py
from numba import cuda
import numpy as np

@cuda.jit
def uncoalesced_matrix_add(a, b, out):
    x, y = cuda.grid(2)
    out[x][y] = a[x][y] + b[x][y]

@cuda.jit
def coalesced_matrix_add(a, b, out):
    x, y = cuda.grid(2)
    out[y][x] = a[y][x] + b[y][x]


nrows, ncols = 512, 1024
tpb = 32

threads_per_block = (tpb, tpb)
blocksu = ((nrows + (tpb - 1))//tpb, (ncols + (tpb - 1))//tpb)
blocksc = ((ncols + (tpb - 1))//tpb, (nrows + (tpb - 1))//tpb)

size = nrows*ncols

a = np.arange(size).reshape(nrows, ncols).astype(np.int32)
b = np.ones(shape=a.shape, dtype=np.int32)
out = np.empty_like(a).astype(np.int32)

d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.to_device(out)

uncoalesced_matrix_add[blocksu, threads_per_block](d_a, d_b, d_out)
slow = d_out.copy_to_host()

coalesced_matrix_add[blocksc, threads_per_block](d_a, d_b, d_out)
fast = d_out.copy_to_host()

print(np.array_equal(slow, fast))
# True
$ python t62.py
True
$

Also note that this grid sizing strategy only works for dimensions that are whole-number divisible by the block size.