Search code examples
cpucpu-architectureenergycpu-cycles

Does changing one single bit consume less cycles per byte than adding/subtracting/xoring an entire processor word?


Let's suppose I change one single bit in a word and add two other words.

Does changing one bit in a word consume less CPU cycles than changing an entire word?

If it consumes less CPU cycles, how much faster would it be?


Solution

  • Performance (in clock cycles) is not data-dependent for integer ALU instructions other than division on most CPUs. ADD and XOR have the same 1-cycle latency on the majority of modern pipelined CPUs. (And the same cycle cost as each other on most older / simpler CPUs, whether or not it's 1 cycle.)
    See https://agner.org/optimize/ and https://uops.info/ for numbers on modern x86 CPUs.

    Lower power can indirectly affect performance by allowing higher boost clocks without having to slow down for thermal limits. But the difference in this case is so small that I don't expect it would be a measurable difference on a mainstream CPU, like the efficiency cores of an Alder Lake, or even a mobile phone CPU that's more optimized for low power.


    Power in a typical CPU (using CMOS logic) scales with how many gates have their outputs change value per cycle. When a transistor switches on, it conducts current from Vcc or to ground, charging or discharging the tiny parasitic capacitance of the things the logic gate's output is connected to. Since the majority of the (low) resistance in the path of that current is in the transistor itself, that's where the electrical energy turns into heat.

    For more details, see:


    ADD does require carry propagation potentially across the whole width of the word, e.g. for 0xFFFFFFFF + 1, so ALUs use tricks like carry-lookahead or carry-select to keep the worst case gate-delay latency within one cycle.

    So ADD involves more gates than a simple bitwise operation like XOR, but still not many compared to the amount of gates involved in controlling all the decode and other control logic to get the operands to the ALU and the result written back (and potentially bypass-forwarded to later instructions that use the result right away.)

    Also, a typical ALU probably doesn't have fully separate adder vs. bitwise units, so a lot of those adder gates are probably seeing their inputs change, but control signals block carry propagation. (i.e. a typical ALU implements XOR using a lot of the same gates as ADD, but with control signals controlling AND gates or something to all or block carry propagation. XOR is add-without-carry.) An integer ALU in a CPU will usually be at least an adder-subtractor so one of the inputs is coming through multiple gates, with other control signals that can make it do bitwise ops.

    But there's still maybe a few fewer bit-flips when doing an XOR operation than an ADD. Partly it would depend what the previous outputs were (of whatever computation it did in the previous cycle, not the value of one of the inputs to the XOR). But with carry propagation blocked by AND gates, flipping the inputs to those gates doesn't change the outputs, so less capacitance is charged or discharged.

    In a high-performance CPU, a lot of power is spent on pipelining and out-of-order exec, tracking instructions in flight, and writing back the results. So even the whole ALU ADD operation is a pretty minor component of total energy cost to execute the instruction. Small differences in that power due to operands are an even smaller difference. Pretty much negligible compared to how many gates flip every clock cycle just to get data and control signals sent to the right place.


    Another tiny effect: if your CPU didn't do register renaming, then possibly a few fewer transistors might flip (in the register file's SRAM) when writing back the result if it's almost the same as what that register held before.

    (Assuming an ISA like x86 where you do xor dst, src for dst ^= src, not a 3-operand ISA where xor dst, src1, src2 could be overwriting a different value if you didn't happen to pick the same register for dst and src1.)

    If your CPU does out-of-order exec with register renaming, writes to the register file won't be overwriting the same SRAM cells as the original destination value, so it depends what other values were computed recently in registers.


    If you want to see a measurable difference in power, run instructions like integer multiply, or FP mul or FMA. Or SIMD instructions, so the CPU is doing 4x or 8x 32-bit addition or shuffle in parallel. Or 8x 32-bit FMA. The max-power workload on a typical modern x86 CPU is two 256-bit FMAs per clock cycle.

    See also: