Memory Accesses Make a CUDA Kernel extremely slow

I'm trying to use cuda to make a basic fragment shader, and I have found that actually executing the kernel takes over a second, which is unacceptable for a shader that I'm trying to run in real time. I found using the synchronize method and by commenting some of the kernel that it is the memory accesses to the output array that are what's causing it to be so slow. I haven't really tried anything to solve the problem because I can't even fathom where to start. This is in PyCUDA, which I don't think really matters, but here's the kernel code:

__global__ void fragment_shader(int palette_lim,float *palette, float *input, float *output) {
    int fragment_idx = (3*gridDim.y*blockIdx.x)+(3*blockIdx.y);
    float min_dist = sqrtf(3);
    float color_dist;
    int best_c = 0;
    for (int c=0;c<palette_lim;c++) {
        color_dist = sqrtf(pow(input[fragment_idx]-palette[c*3],2)+pow(input[fragment_idx+1]-palette[c*3+1],2)+pow(input[fragment_idx+2]-palette[c*3+2],2));
        if (color_dist < min_dist) {
            min_dist = color_dist;
            best_c = c;
        }
    }

    //These are the lines that make it slow. If these lines get commented out, it runs in a time that would be acceptable
    output[fragment_idx] = palette[best_c*3];
    output[fragment_idx+1] = palette[best_c*3+1];
    output[fragment_idx+2] = palette[best_c*3+2];
}

EDIT: After playing around with it a bit more, I found that it also has to do with what's being assigned to the output array, because when I had it write some constants rather than something from the palette it also worked just fine, it just didn't do anything useful then.

Solution

First some remarks on your actual computation:

You compare sqrtf(x) < sqrtf(3). Roots are expensive. Just compare x < 3.f
Even if you want to keep the square root to avoid overflowing the float range (probably not a concern), don't use sqrt(pow(x, 2)+...), for that matter don't use pow just for squaring. Use hypotf for 2D or norm3df for 3D vectors
You save the last value in your palette that is below the limit. It very much looks like you want to fit the best color

Now let's analyze your memory accesses:

fragment index

Let's look at fragment_idx = 3*gridDim.y*blockIdx.x+3*blockIdx.y: You're not taking threadIdx.x and threadIdx.y into account. This is your main problem: Many threads act on the same input and output data. You likely want this: fragment_idx = 3 * (threadIdx.y * blockDim.x + threadIdx.x)

input

So you load 3 floats. For starters, why do you reload it inside the loop when it isn't dependent on the loop iteration? I assume the compiler saves you from that access but don't get in the habit of doing that.

Second, your access pattern isn't properly coalesced since a) these are 3 independent accesses and b) CUDA cannot coalesce accesses to float3 vectors even if you did it properly. Please read section 9.2.1 Coalesced Access to Global Memory of the Best Practices Guide. For better performance you have two options:

You add 1 float per fragment_idx so you can load the whole thing as a float4
You transpose the your input array from a Nx3 matrix into an 3xN matrix

palette

Same problem with the access of 3 floats. Plus, now every thread reads the same values since c doesn't depend on the thread index. At the very least, the access should go through the __ldg function to use the L1 cache. Preferably you prefetch the palette into shared memory

output

The write access has the same issue with uncoalesced access. Plus, since best_c varies among threads, the read accesses to palette are random. You had to load the palette values before in your loop. Just save the best palette value in a local variable and reuse it to store the output in the end.

Methodology

Two remarks:

Try to make your code valid before making it fast. That would have caught the fragment_idx
If you simplify the code such as removing the output, the compiler will happily optimize most of your code away. That's not how you do proper performance assessment. Use a profiler. CUDA comes with very good ones

Minimal fix

This is the simplest code to rectify the issues. It doesn't solve the issues with loading vector3 variables and it doesn't use shared memory. That requires more involved changes


__device__ float sqr_norm(float3 a, float3 b) {
    a.x -= b.x, a.y -= b.y, a.z -= b.z;
    return a.x * a.x + a.y * a.y + a.z * a.z;
}
__global__ void fragment_shader(int palette_lim,
          const float *palette, const float *input,
          float *output) {
    int fragment_idx = 3 * (threadIdx.y * blockDim.x + threadIdx.x);
    /* TODO: Switch to float4 for better memory access patterns */
    float3 inputcolor = make_float3(
          input[fragment_idx], input[fragment_idx + 1], input[fragment_idx + 2]);
    float min_dist_sqr = 3.f;
    /* The old code always used color index 0 if there was no fit */
    float3 best_color = make_float3(
          __ldg(palette), __ldg(palette + 1), __ldg(palette + 2));
    float best_dist = sqr_norm(best_color, inputcolor);
    for(int c = 1; c < palette_lim; c++) {
        /* TODO: Prefetch into shared memory */
        float3 color = make_float3(
              __ldg(palette + c), __ldg(palette + c + 1), __ldg(palette + c + 2));
        float dist = sqr_norm(color, inputcolor);
        /* Since we always used color 0 in the old code,
         * the min_dist is somewhat pointless */
        if(dist < min_dist_sqr && dist < best_dist) {
            best_color = color;
            best_dist = dist;
        }
    }
    output[fragment_idx] = best_color.x;
    output[fragment_idx + 1] = best_color.y;
    output[fragment_idx + 2] = best_color.z;
}

Extensive fix

Here is a more extensive rewrite:

All arrays are changed to float4 (RGBA instead of RGB). The extra channel is ignored for distance computation but it is propagated. Typically one tries to use the value for something, e.g. you could store the distance value in there
Shared memory is used to buffer the color palette. The dynamic shared memory requirement is outlined in the code comments

__device__ float sqr_dist_rgb(float4 a, float4 b) {
    a.x -= b.x, a.y -= b.y, a.z -= b.z;
    return a.x * a.x + a.y * a.y + a.z * a.z;
}
__global__ void fragment_shader(int palette_lim,
          const float4 *palette, const float4 *input,
          float4 *output) {
    /* Call with dynamic shared memory:
     * 2 * sizeof(float4) * blockDim.x * blockDim.y */
    extern __shared__ float4 colorbuf[];
    const int buf_size = blockDim.x * blockDim.y;
    const int buf_idx = threadIdx.y * blockDim.x + threadIdx.x;
    const int fragment_idx = threadIdx.y * blockDim.x + threadIdx.x;
    const float4 inputcolor = input[fragment_idx];
    float4 best_color =  __ldg(palette);
    const float min_dist_sqr = 3.f;
    float best_dist = sqr_dist_rgb(best_color, inputcolor);
    for(int cb = 0, b = 0; cb < palette_lim; b ^= 1, cb += buf_size) {
        /* We use a double buffer scheme to reduce the __syncthreads calls */
        float4* cur_buf = b ? colorbuf + buf_size : colorbuf;
        if(cb + buf_idx < palette_lim)
            cur_buf[buf_idx] = __ldg(palette + cb + buf_idx);
        __syncthreads();
        const int n = min(buf_size, palette_lim - cb);
        for(int c = 0; c < n; c++) {
            float4 color = cur_buf[c];
            float dist = sqr_dist_rgb(color, inputcolor);
            if(dist < min_dist_sqr && dist < best_dist) {
                best_color = color;
                best_dist = dist;
            }
        }
    }
    output[fragment_idx] = best_color;
}

Algorithmic improvements

For larger palettes, a brute force search like this is suboptimal. Spatial index algorithms can do the same thing but faster. The classic structure for this would be a KD tree. If you search for this, you will find some CUDA implementations that might be worth checking out.