How do I convince GCC to unroll a loop where the number of iterations is known, but large?
I'm compiling with -O3
.
The real code in question is more complex, of course, but here's a boiled-down example that has the same behavior:
int const constants[] = { 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144 };
int get_sum_1()
{
int total = 0;
for (int i = 0; i < CONSTANT_COUNT; ++i)
{
total += constants[i];
}
return total;
}
...if CONSTANT_COUNT
is defined as 8 (or less) then GCC will unroll the loop, propagate the constants, and reduce the entire function down to a simple return <value>;
. If, on the other hand, CONSTANT_COUNT
is 9 (or greater) then the loop is not unrolled, and GCC produces a binary which loops, reads the constants, and adds them at run-time - even though, in theory, the function could still be optimized down to just returning a constant. (Yes, I've looked at the decompiled the binary.)
If I manually unroll the loop, like this:
int get_sum_2()
{
int total = 0;
total += constants[0];
total += constants[1];
total += constants[2];
total += constants[3];
total += constants[4];
total += constants[5];
total += constants[6];
total += constants[7];
total += constants[8];
//total += constants[9];
return total;
}
Or this:
#define ADD_CONSTANT(z, v, c) total += constants[v];
int get_sum_2()
{
int total = 0;
BOOST_PP_REPEAT(CONSTANT_COUNT, ADD_CONSTANT, _)
return total;
}
...then the function is optimized down to returning a constant. So, GCC appears to be able to handle the constant propagation for larger loops, once unrolled; the hang-up appears to be just getting GCC to consider unrolling the longer loop in the first place.
However, neither hand-unrolling nor BOOST_PP_REPEAT
are viable options, because there are some cases where CONSTANT_COUNT
is a run-time expression, and the same code still needs to work correctly for those cases. (Performance isn't as critical in those cases.)
I'm working in C (not C++) so neither template meta-programming nor constexpr
are available to me.
I've tried -funroll-loops
, -funroll-all-loops
, -fpeel-loops
, and setting large values for max-unrolled-insns
, max-average-unrolled-insns
, max-unroll-times
, max-peeled-insns
, max-peel-times
, max-completely-peeled-insns
, and max-completely-peel-times
, none of which seem to make a difference.
I'm using GCC 4.8.2, on Linux, x86_64.
Any ideas? Is there a flag or parameter I'm missing...?
I'm not sure if this workaround is applicable to your actual problem but I found that GCC 4.9.0 20140604 (prerelease) on an x86_64 running Parabola GNU/Linux unrolls the following loop up to and including CONSTANT_COUNT == 33
.
int
get_sum()
{
int total = 0;
int i, j, k = 0;
for (j = 0; j < 2; ++j)
{
for (i = 0; i < CONSTANT_COUNT / 2; ++i)
{
total += constants[k++];
}
}
if (CONSTANT_COUNT % 2)
total += constants[k];
return total;
}
I have only passed it the -O3
flag. The assembly code for get_sum
is really just
movl $561, %eax
ret
I did not try but maybe the pattern can be extended even further.
It seems odd to me that this works since – at least to my human eyes – the code looks much more complex now. Unfortunately, it is a rather intrusive way to force the unrolling. A compiler flag would have been much nicer.