Search code examples
performanceopenglgpuglslintrinsics

"Intrinsics" possible on GPU on OpenGL?


I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.

So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel). Part of my main-function in GLSL looks like this:

float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX + gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY + gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i++)
{
    XR2 = XR * XR;
    XI2 = XI * XI;
    XI = 2 * XR * XI + CI;
    XR = XR2 - XI2 + CR;
    if ((XR * XR + XI * XI) > 4.0)
    {
        break;
    }
}

So my thought was using vec4's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics). But my code seems to run MUCH slower than the float-version. There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code. Before I try around for ages, can anybody tell me right away, if this endeavour is futile?


Solution

  • CPUs and GPUs work quite differently.

    CPUs need explicit vectorization in the machine code, either coded manually by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler.

    GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores in parallel.

    AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats per clock (for example), it's more beneficial to have four times as many simpler cores, each of them being able to add a single float per clock. This way you still get the same peak FLOPS for the entire chip, while at the same time enabling full utilization of the circuitry even when the individual shader code cannot be vectorized. The thing is that most code, by means of necessity, will have at least some scalar computations in it.

    The bottom line is: it's likely that your code already squeezes the most out of the GPU as possible for this specific task.