Search code examples
cgccloop-unrolling

Force/Convince/Trick GCC into Unrolling _Longer_ Loops?


How do I convince GCC to unroll a loop where the number of iterations is known, but large?

I'm compiling with -O3.

The real code in question is more complex, of course, but here's a boiled-down example that has the same behavior:

int const constants[] = { 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144 };

int get_sum_1()
{
    int total = 0;
    for (int i = 0; i < CONSTANT_COUNT; ++i)
    {
        total += constants[i];
    }
    return total;
}

...if CONSTANT_COUNT is defined as 8 (or less) then GCC will unroll the loop, propagate the constants, and reduce the entire function down to a simple return <value>;. If, on the other hand, CONSTANT_COUNT is 9 (or greater) then the loop is not unrolled, and GCC produces a binary which loops, reads the constants, and adds them at run-time - even though, in theory, the function could still be optimized down to just returning a constant. (Yes, I've looked at the decompiled the binary.)

If I manually unroll the loop, like this:

int get_sum_2()
{
    int total = 0;
    total += constants[0];
    total += constants[1];
    total += constants[2];
    total += constants[3];
    total += constants[4];
    total += constants[5];
    total += constants[6];
    total += constants[7];
    total += constants[8];
    //total += constants[9];
    return total;
}

Or this:

#define ADD_CONSTANT(z, v, c) total += constants[v];

int get_sum_2()
{
    int total = 0;
    BOOST_PP_REPEAT(CONSTANT_COUNT, ADD_CONSTANT, _)
    return total;
}

...then the function is optimized down to returning a constant. So, GCC appears to be able to handle the constant propagation for larger loops, once unrolled; the hang-up appears to be just getting GCC to consider unrolling the longer loop in the first place.

However, neither hand-unrolling nor BOOST_PP_REPEAT are viable options, because there are some cases where CONSTANT_COUNT is a run-time expression, and the same code still needs to work correctly for those cases. (Performance isn't as critical in those cases.)

I'm working in C (not C++) so neither template meta-programming nor constexpr are available to me.

I've tried -funroll-loops, -funroll-all-loops, -fpeel-loops, and setting large values for max-unrolled-insns, max-average-unrolled-insns, max-unroll-times, max-peeled-insns, max-peel-times, max-completely-peeled-insns, and max-completely-peel-times, none of which seem to make a difference.

I'm using GCC 4.8.2, on Linux, x86_64.

Any ideas? Is there a flag or parameter I'm missing...?


Solution

  • I'm not sure if this workaround is applicable to your actual problem but I found that GCC 4.9.0 20140604 (prerelease) on an x86_64 running Parabola GNU/Linux unrolls the following loop up to and including CONSTANT_COUNT == 33.

    int
    get_sum()
    {
      int total = 0;
      int i, j, k = 0;
      for (j = 0; j < 2; ++j)
        {
          for (i = 0; i < CONSTANT_COUNT / 2; ++i)
            {
              total += constants[k++];
            }
        }
      if (CONSTANT_COUNT % 2)
        total += constants[k];
      return total;
    }
    

    I have only passed it the -O3 flag. The assembly code for get_sum is really just

    movl $561, %eax
    ret
    

    I did not try but maybe the pattern can be extended even further.

    It seems odd to me that this works since – at least to my human eyes – the code looks much more complex now. Unfortunately, it is a rather intrusive way to force the unrolling. A compiler flag would have been much nicer.