I know that xorq %rax,%rax
is faster than movq $0,%rax
because my compiler has told me. However, if I didn't know the answer, what should I do to compare xorq and movq performances?
What I have tried is this:
int main(void)
{
long a;
long i = 0;
for (i = 0; i < 10000000000l; i++) {
a = 10;
__asm__(
#if 0
"movq $0, %%rax"
#else
"xorq %%rax, %%rax"
#endif
: "=a" (a) : "a" (a))
}
return 0;
}
However when I time the program (once with #if 0
, once with #if 1
), I keep getting very similar results (5.876±0.001 seconds). FYI, I have set the scaling governor to the lowest frequency and I have checked the user line returned by time(1)
.
I've also tried with addq %rax,%rax
vs imulq $2,%rax
, again with no luck.
I know that modern processors are pretty smart at optimizing code execution, and I guess this is why I'm not getting helpful results. So I'm here to ask: how should I proceed? Am I on the right path?
You're going to have to unroll the guts of the loop a lot of times, like 10 or 100. Otherwise mainly you're measuring the loop overhead. Also I would suggest for (i = 1000...; --i>=0;)
which might compile into fewer instructions.