Search code examples
performanceassemblyx86x86-64micro-optimization

Is movzbl followed by testl faster than testb?


Consider this C code:

int f(void) {
    int ret;
    char carry;

    __asm__(
        "nop # do something that sets eax and CF"
        : "=a"(ret), "=@ccc"(carry)
    );

    return carry ? -ret : ret;
}

When I compile it with gcc -O3, I get this:

f:
        nop # do something that sets eax and CF
        setc    %cl
        movl    %eax, %edx
        negl    %edx
        testb   %cl, %cl
        cmovne  %edx, %eax
        ret

If I change char carry to int carry, I instead get this:

f:
        nop # do something that sets eax and CF
        setc    %cl
        movl    %eax, %edx
        movzbl  %cl, %ecx
        negl    %edx
        testl   %ecx, %ecx
        cmovne  %edx, %eax
        ret

That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:

f:
        nop # do something that sets eax and CF
        jnc     .L1
        negl    %eax
.L1:
        ret

It seems like one of two things must be true, but I'm not sure which:

  1. A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
  2. A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.

My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.

By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.


Solution

  • There's no downside I'm aware of to reading a byte register with test vs. movzb.

    If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.

    Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.