c++assembly x86-64 cpu-architecture conditional-move

The Execution Order in an Assembly Algorithm that utilizes a Conditional Move Instruction

I have recently been working on an assembly algorithm for calculating the absolute difference between two long integers, and I have some concerns regarding the execution order in the presence of conditional move instructions. Below is a snippet of the assembly code in question:

.globl _start

_start:
    movq $3, %rsi # x
    movq $5, %rdi # y
    
    movq %rdi, %rax
    subq %rsi, %rdi
    subq %rax, %rsi
    cmovgq %rsi, %rdi
    
    movq $60, %rax
    syscall

In the provided assembly code, the cmovgq instruction is used to conditionally move the result of the subtraction (x - y or y - x) based on the outcome of a previous comparison (x > y). My concern is whether the processor executes the cmovgq instruction before waiting for the result from the subtraction (subq %rax, %rsi).

I am aware that there is no explicit data dependency between the subtraction instruction and the conditional move instruction. However, due to modern processors' pipelining behavior, I am uncertain about the potential issues that might arise.

To provide context, the corresponding C code for this assembly algorithm is as follows:

long absdiff(long x, long y) {
    long result = 0;
    // unpredictable branch
    if(x < y) {
        result = y - x;
    } else {
        result = x - y;
    }
    return result;
}

I would greatly appreciate any insights, explanations, or suggestions from the community regarding the execution order of these instructions and whether pipelining could introduce any issues in this algorithm.

Solution

cmov has 3 inputs: two registers (explicit operands), and FLAGS (implicit).
Same as adc and sbb, except those also write FLAGS as an output.

The data dependency between sub writing FLAGS and cmovg reading it is every bit as real as between sub instructions writing rsi and rdi before cmovg reads them.
In formal terms, this is a RAW (Read After Write) data hazard which CPUs must respect. See wikipedia for classic RISC in-order pipelines, and data hazards in general.

(All current x86-64 CPUs do out-of-order exec, where a scheduler keeps track of which values are ready, and thus which uops are ready to run if there's a free execution unit. Register renaming avoids WAR and WAW hazards. See Modern Microprocessors A 90-Minute Guide! and deep-dive breakdowns of real CPUs like Sandybridge, the first of what's still Intel's current family of big-core CPUs.)

cmov isn't special, it's just an ALU select operation with 3 inputs and one output. Not like 32-bit ARM predicated instructions where a false predicate turns it into a NOP, so even a load from a bad address won't fault. (Unlike x86 cmov, but like APX cfcmov, conditionally-faulting cmov conditional load or store.)

FLAGS is renamed just like other registers, so CPUs respect data dependencies through FLAGS exactly the same as through integer registers. A single register-file entry can hold an integer register and FLAGS, so the CPU can treat instructions like sub that write FLAGS and a register as having a single output.

A later instruction like LEA that writes a register but not FLAGS can leave just FLAGS pointing at the older physical register in the RAT (register allocation table), or an instruction like cmc or add [mem], eax that writes FLAGS but not a register would have to allocate a PRF entry for just FLAGS, and leave the old entry pointed-to by just the integer register.

The cardinal rule of pipelined out-of-order exec is that a single thread runs as if the CPU ran one instruction at a time in program order. An instruction can always see the results of previous instructions, and never see the result of future instructions. If this wasn't true, the CPU wouldn't correctly implement the ISA specification and couldn't run existing code. i.e. the design would be buggy.

(I'm just talking about correctness here, not performance differences like cache conflict-misses caused by early exec of later loads that evicted data needed by a load whose address wasn't ready yet. Microarchitectural state can depend on timing details; Spectre and other CPU vulnerabilities use timing to turn microarchitectural state into architectural state (long-term values in registers or memory))

A few ISAs have had execution models other than just 1 at a time in program order. For example VLIW machines expect the compiler (or asm programmer) to find 3 (for example) independent operations to put into one wide instruction-word, so the CPU doesn't have to check for dependencies or hazards between them. And The Mill had loads that produce a value like 5 instructions from now (depending on the microarchitecture so you had to recompile if that changed). So you can keep reading the old value until then, but also hide some load latency.

For x86 (and current mainstream ISAs like RISC-V and AArch64), the execution model is indeed that instructions execute one at a time, in program order. It's basically like the C++ as-if rule: internally it can do anything that results in the same register and memory values for a single thread. Memory-ordering rules define what orderings of operations other threads looking at shared memory are or aren't allowed to see. (Memory ordering is not the same thing as execution order, e.g. store buffers create memory reordering on their own, at least StoreLoad, even on an in-order CPU. A store buffer is important for speculative execution of store instructions, among other things.)