performance assembly latency instructions

How do I compare ASM instruction speeds?

I know that xorq %rax,%rax is faster than movq $0,%rax because my compiler has told me. However, if I didn't know the answer, what should I do to compare xorq and movq performances?

What I have tried is this:

int main(void)
{
  long a;
  long i = 0;
  for (i = 0; i < 10000000000l; i++) {
    a = 10;
    __asm__(
#if 0
            "movq $0, %%rax"
#else
            "xorq %%rax, %%rax"
#endif
            : "=a" (a) : "a" (a))
  }
  return 0;
}

However when I time the program (once with #if 0, once with #if 1), I keep getting very similar results (5.876±0.001 seconds). FYI, I have set the scaling governor to the lowest frequency and I have checked the user line returned by time(1).

I've also tried with addq %rax,%rax vs imulq $2,%rax, again with no luck.

I know that modern processors are pretty smart at optimizing code execution, and I guess this is why I'm not getting helpful results. So I'm here to ask: how should I proceed? Am I on the right path?

Solution

You're going to have to unroll the guts of the loop a lot of times, like 10 or 100. Otherwise mainly you're measuring the loop overhead. Also I would suggest for (i = 1000...; --i>=0;) which might compile into fewer instructions.