Search code examples
vectorizationopenclgpgpuopencl-c

Why does vectorialization of this simple openCl kernel make it slower?


I am currently implementing a naive parallelized index sum on gpu using OpenCl. This is part of a project that needs continuous stream compaction on large arrays, so I thought that it would be a good idea to first familiarize with the algorithm and then try and build one of the more advanced versions (which from my limited understanding all derive from Blelloch's algorithm).

Currently, I have a kernel, called step_naive_prefix_sum which just adds a onto itself, with a certain shift, and saves the result into b.

kernel void step_naive_prefix_sum(global int* a, global int* b, int offset, int nels)
{
    int i = get_global_id(0);
    if (i >= nels) return;
    b[i] = a[i];
    if (i >= offset) {
        b[i] += a[i - offset];
    }
}

The kernel works, and so does the rest of the algorithm in turn.

On a very small test load i got these results (as the sum of the kernel's times reported via clGetEventProfilingInfo every time it finished):

prefix_sum: 68.112549 ms

I thought that this kind of task would be perfectly suited for vectoriazation by using OpenCl vector types. So i swiftly coded the variants:

kernel void step_naive_prefix_sum_vectorized2(global int2* a, global int2* b, int offset, int ncouples)
{
    int i = get_global_id(0);
    if (i >= ncouples) return;
    b[i] = a[i];
    if (i >= offset) {
        b[i] += a[i - offset];
    }
}

kernel void step_naive_prefix_sum_vectorized4(global int4* a, global int4* b, int offset, int nquarts)
{
    int i = get_global_id(0);
    if (i >= nquarts) return;
    b[i] = a[i];
    if (i >= offset) {
        b[i] += a[i - offset];
    }
}

And modified the loop that calls the kernel so that it handles the case where offset = 1, offset = 2, offset >= 4 by:

  • offset = 1: call the normal kernel
  • offset = 2: call the int2 kernel, with halved global size (still rounded to local group size multiples) and passing ncouples = nels/2
  • offset >=4: call the int4 kernel, with global size divided by 4 (still rounded to local group size multiples) and passing ncouples = nels/4

Note that, by design, pad the array size to (way higher than 4) powers of 2, so no need to check for the edge cases at the end of the array.

Yet, the time becomes:

prefix_sum: 79.486565 ms

which surprises me. The results remain consistent across some additional test loads.

Can someone help me understand what went wrong, and maybe point me in the right direction? And of course, any other tip about how to make this (or similar) run faster is well accepted!


Solution

  • The vector types break coalesced memory access.

    Your kernel is bandwidth-bound. int2 and int4 are structs holding 2 or 4 integers, so consecutive threads are no longer accessing consecutive memory locations but 2- or 4-strided locations.

    On old GPU architectures, strided memory access was a deal-breaker for performance. Modern architectures have mechanisms built-in to compensate, but it still negatively impacts memory bandwidth.

    If your kernel was not bandwidth-bound (a lot more arithmetic), the vectorization would still not make it any faster, as GPU ALUs are incapable of SIMD, apart from exceptions such as half2 since Nvidia Turing and AMD Polaris, and float2 on RDNA3's dual-issuing.

    Such vectorization (and Structure-of-Arrays memory layout) is only beneficial on CPUs, because they do SIMD.