assembly x86-64 cpu-architecture micro-optimization branch-prediction

Is CMOVcc considered a branching instruction?

I have this memchr code that I'm trying to make non-branching:

.globl memchr
memchr:
        mov %rdx, %rcx
        mov %sil, %al
        cld
        repne scasb
        lea -1(%rdi), %rax
        test %rcx, %rcx
        cmove %rcx, %rax
        ret

I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch?

Solution

No, it's not a branch, that's the whole point of cmovcc.

It's an ALU select that has a data dependency on both inputs, not a control dependency. (With a memory source, it unconditionally loads the memory source, unlike ARM predicated load instructions that are truly NOPed. So you can't use it with maybe-bad pointers for branchless bounds or NULL checks. That's maybe the clearest illustration that it's definitely not a branch.)

But anyway, it's not predicted or speculated in any way; as far as the CPU scheduler is concerned it's just like an adc instruction: 2 integer inputs + FLAGS, and 1 integer output. (Only difference from adc/sbb is that it doesn't write FLAGS. And of course runs on an execution unit with different internals).

Whether that's good or bad entirely depends on the use-case. See also gcc optimization flag -O3 makes code slower than -O2 for much more about cmov upside / downside

Note that repne scasb is not fast. "Fast Strings" only works for rep stos / movs.

repne scasb runs about 1 count per clock cycle on modern CPUs, i.e. typically about 16x worse than a simple SSE2 pcmpeqb/pmovmskb/test+jnz loop. And with clever optimization you can go even faster, up to 2 vectors per clock saturating the load ports.

(e.g. see glibc's memchr for ORing pcmpeqb results for a whole cache line together to feed one pmovmskb, IIRC. Then go back and sort out where the actual hit was.)

repne scasb also has startup overhead, but microcode branching is different from regular branching: it's not branch-predicted on Intel CPUs. So this can't mispredict, but is total garbage for performance with anything but very small buffers.

SSE2 is baseline for x86-64 and efficient unaligned loads + pmovmskb make it a no-brainer for memchr where you can check for length >= 16 to avoid crossing into an unmapped page.

Fast strlen:

Why is this code 6.5x slower with optimizations enabled? shows a simple not-unrolled strlen for 16-byte-aligned inputs using SSE2.
Why does glibc's strlen need to be so complicated to run quickly? links to some more stuff about hand-optimized asm strlen functions in glibc. (And how to make a bithack strlen in GNU C avoid strict-aliasing UB.)
https://codereview.stackexchange.com/a/213558 scalar bithack strlen, including the same 4-byte-at-a-time bithack that the glibc question was about. Better than byte-at-a-time but pointless with SSE2 (which x86-64 guarantees). However, @CodyGray's tutorial-style answer may be a useful for beginners. Note that it doesn't take into account Is it safe to read past the end of a buffer within the same page on x86 and x64?