Search code examples
cperformanceassemblyopencl

Does OpenCL C compiler simplify math expressions?


I am making a fractal generator and it needs to be really fast. Currently, a line of OpenCL C is being generated based on a user formula:

// User inputs z^2 + c + z^2 for example, generating this line of code:
z = cpow(z, 2) + c + cpow(z, 2);

My question is, when this line is compiled into assembly, will it execute the cpow(z, 2) calculation twice to compute the expression, or is OpenCL C optimised to only do that calculation once, and reuse that result for when it comes across the second cpow(z, 2)?


Solution

  • General rule, for any programming language: Never trust the compiler to do any optimization for you.

    For certain simple things, you can be sure that the OpenCL compiler will optimize. Examples:

    • float x = y + (2.0f/3.0f + 4.0f); // the compiler will pre-compute arithmetic with literals, as long as it does not alter the order of operations, and in assembly you will get only a single addition. So use brackets!
    • if(x<y) x = 4; else x = 5; // the compiler will eliminate branching here and use the same assembly as for the ternary operator
    • float y = a*x+c; // the compiler will compressed this in a single fused-multiply-add (FMA) instruction that does both the Multiplikation and the addition in a single clock cycle
    • for(int i=0; i<8; i++) x = x%y; // the compiler will unroll the loop, so no clock cycles are wasted for incrementing i
    • float x = some complicated arithmetic; but then x is never used; // the compiler will delete x and all arithmetic that is used to compute its value

    But still there is many pitfalls - small details like in the first example not writing brackets - that lead to the compiler not optimizing to the full extent. You can experiment with https://godbolt.org/ to see what works and what doesn't. In OpenCL with Nvidia GPUs, you can generate PTX assembly and look into that.

    Also the compiler is not too smart and does not always generate perfectly optimized assembly. In your example, the safe way for ideal performance - regardless of compiler settings - would be to just write it in an optimized manner:

    z = 2*z*z+c; // The pow function is way slower than just a multiplication. In OpenCL, the compiler here will see a*b+c and compress that into a FMA instruction. So 1 multiplication and 1 FMA for this line.
    

    A typical trick is also to use temporary variables for redundant terms in equations, and then just insert the variable whereever the term is occurring.

    Besides suboptimal performance if the compiler does not optimize properly, with floating-point arithmetic you may also get suboptimal accuracy through larger round-off error, as that depends on what numbers you add and in which order. You should control and optimize this manually in the code; usually the compiler then does not change the order of operations.