optimization cuda pycuda gpu-shared-memory

CUDA profiling - high shared transactions/access but low local replay rate

After running the Visual Profiler, guided analysis tells me that I'm memory-bound, and that in particular my shared memory accesses are poorly aligned/accessed - basically every line I access shared memory is marked as ~2 transactions per access.

However, I couldn't figure out why that was the case (my shared memory is padded/strided so that there shouldn't be bank conflicts), so I went back and checked the shared replay metric - and that says that only 0.004% of shared accesses are replayed.

So, what's going on here, and what should I be looking at to speed up my kernel?

EDIT: Minimal reproduction:

import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gp

mod = SourceModule("""

(splitting the code block to get both Python and CUDA/C++ coloring)

typedef unsigned char ubyte;

__global__ void identity(ubyte *arr, int stride) 
{
    const int dim2 = 16;
    const int dim1 = 64;
    const int dim0 = 33;
    int shrstrd1 = dim2;
    int shrstrd0 = dim1 * dim2;
    __shared__ ubyte shrarr[dim0 * dim1 * dim2];

    auto shrget = [shrstrd0, shrstrd1, &shrarr](int i, int j, int k) -> int{ 
        return shrarr[i * shrstrd0 + j * shrstrd1 + k]; 
    };

    auto shrset = [shrstrd0, shrstrd1, &shrarr](int i, int j, int k, ubyte val) -> void {
        shrarr[i * shrstrd0 + j * shrstrd1 + k] = val;
    };

    int in_x = threadIdx.x;
    int in_y = threadIdx.y;

    shrset(in_y, in_x, 0, arr[in_y * stride + in_x]);
    arr[in_y * stride + in_x] = shrget(in_y, in_x, 0);
}
""",

(ditto)

options=['-std=c++11'])

#Equivalent to identity<<<1, dim3(32, 32, 1)>>>(arr, 64);
identity = mod.get_function("identity")
identity(gp.zeros((64, 64), np.ubyte), np.int32(64), block=(32, 32, 1))

2 transactions per access, shared replay overhead 0.083. Decreasing dim2 to 8 makes the problem go away, which I also don't understand.

Solution

Partial answer: I had a fundamental misunderstanding of how shared memory banks worked (namely, that they are banks of around a thousand byte-banks each) and so didn't realize that they looped around, so that too much padding meant that 32 row elements might end up using each bank more than once.

Presumably, though, that conflict just didn't come up every time - instead it came up, oh, about 85 times a block, from the numbers.

I'll leave this here for a day in hopes of a more complete explanation, then close and accept this answer.