x86 rep prefix with a count of zero: what happens?

What happens for an initial count of zero for an x86 rep prefix?

Intel's manual says explicitly it’s a while count != 0 loop with the test at the top, which is the sane expected behaviour.

But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat {… count —=1; } until count == 0; or who knows.

Solution

Nothing happens with RCX=0; rep prefixes do check for zero first like the pseudocode says. (Unlike the loop instruction which is exactly like the bottom of a do{}while(--rcx), or a dec rcx/jnz but without affecting FLAGS.)

I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw or rep stosw with a count of 0 or 1, especially in the bad old days before cmov. (cmov is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods with a count of zero.) This is not efficient especially for rep stos on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.) Golden Cove (Alder Lake / Sapphire Rapids) additionally has fast zero-length rep movsb which makes that the same speed as 1-128 bytes, making it not terrible for use-cases that sometimes do a zero length memcpy.

The same applies for instructions that treat the prefixes as repz / repnz (cmps/scas) instead of unconditional rep (lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.

If you want to check FLAGS after a repe/ne cmps/scas, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)

rep movs and rep stos have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.

repe/ne scas/cmps do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q) according to testing by https://agner.org/optimize/ and https://uops.info/.

What setup does REP do?
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - GCC -O1 used to use repne scasb to inline strlen. This is a disaster for long strings.
Which processors support "Fast Short REP CMPSB and SCASB" (very recent feature)
Enhanced REP MOVSB for memcpy - even without ERMSB, rep movs will use no-RFO stores for large sizes, similar to NT stores but not bypassing the cache. Good general Q&A about memory bandwidth considerations.

For conditional load / store, APX will also introduce a way to do that efficiently and branchlessly, with scalar instead of AVX2 or AVX-512 masking: a fault-suppressing (Conditionally-Faulting) cfcmovcc [mem], reg as well as a load form. See Hard to debug SEGV due to skipped cmov from out-of-bounds memory for some about that and other conditional-load things x86 supports.

With an address-size prefix: Intel/AMD difference

In 64-bit mode with an address-size prefix to make it use ECX/EDI/ESI instead of RCX/RDI/RSI, writing a 32-bit register will zero-extend into the upper 32 bits. (The ECX can be zero while RCX was non-zero, and the pointer registers might have high garbage so ESI != RSI for example; with 32-bit pointers in long mode ABI maybe that's why you're using an address-size prefix.)

AMD Zen 3 matches Intel's pseudocode, only writing any registers if ECX is non-zero so a modification happens, so high garbage is preserved.

Intel (including Skylake and a couple others Paolo tested) always writes ECX for rep lodsb at least. But writes EDI only if actually used as a pointer. I haven't tested other instructions to see if their microcode is different.

This doesn't match the pseudocode in Intel's manual where all register writes are inside if conditions, but it's not rare for the pseudocode to not match corner-case behaviour. (e.g. push rsp where only the text Description section is accurate for that.) In this case the Description section for rep doesn't mention that corner case of count=0 with 32-bit registers.