Search code examples
assemblyx86-64cpu-architecturemicro-optimizationbranch-prediction

Is CMOVcc considered a branching instruction?


I have this memchr code that I'm trying to make non-branching:

.globl memchr
memchr:
        mov %rdx, %rcx
        mov %sil, %al
        cld
        repne scasb
        lea -1(%rdi), %rax
        test %rcx, %rcx
        cmove %rcx, %rax
        ret

I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?


Solution

  • No, it's not a branch, that's the whole point of cmovcc.

    It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)

    But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc/sbb is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).

    Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov upside / downside


    Note that repne scasb is not fast. "Fast Strings" only works for rep stos / movs.

    repne scasb runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb/pmovmskb/test+jnz loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.

    (e.g. see glibc's memchr for ORing pcmpeqb results for a whole cache line together to feed one pmovmskb, IIRC. Then go back and sort out where the actual hit was.)

    repne scasb also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.

    SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb make it a no-brainer for memchr where you can check for length >= 16 to avoid crossing into an unmapped page.

    Fast strlen: