assembly mips cpu-architecture micro-optimization mips32

About negate a sign-integer in mips?

I'm thinking about how to negate a signed-integer in mips32. My intuition is using definition of 2's complement like: (suppose $s0 is the number to be negated)

nor $t0, $s0, $s0   ; 1's complement
addiu $t0, $t0, 1   ; 2's = 1's + 1

then I realized that it can be done like:

sub $t0, $zero, $s0

so... what's the difference? Which is faster? IIRC sub will try to detect overflow, but would this make is slower? Finally, is there any other way to do so?

Solution

subu $t0, $zero, $s0 is the best way, and is what compilers do.

On any given implementation of MIPS, most simple ALU instructions (add/sub/and/nor) have identical performance. Getting the same work done in 1 simple instruction instead of 2 simple instructions is a win for code-size, latency, and throughput.

Fewer instructions isn't always better, but MIPS being a classic RISC ISA doesn't have many "slow" instructions other than mult / div / rem.

If you wanted -x-1, then you'd optimize that to a single nor $t0, $zero, $s0 using a 2's complement identity.

sub instead of subu would raise an exception on -INT_MIN, which you avoid using addiu in the nor/add version. You should always use the u version of sub and add instructions unless you specifically want signed overflow to raise an exception. C compilers always use the u version. (In C, signed overflow is undefined behaviour¹.)

int neg(int x) { return -x; }

On the Godbolt compiler explorer, MIPS gcc11.2 -O3 -fno-delayed-branch compiles it exactly like we'd expect:

neg(int):
    subu    $2,$0,$4
    jr      $31
    nop              # filling the branch delay slot for

Asking a compiler is usually a good way to find efficient ways to do things in asm. (GCC always makes asm that's compatible with real MIPS CPUs, and GAS is different from MARS/SPIM classic MIPS assembler. See also Tweak mips-gcc output to work with MARS)

IIRC sub will try to detect overflow, but would this make is slower?

No. In the no-exception case, sub has the same performance as subu, as far as I know on all MIPS CPUs.

CPUs are heavily optimized for the common case. Taking an exception happens so rarely in normal code that it's fine for exceptions to take quite a lot of cycles. So the CPU core just has to detect an exception before any bad result is written back to the register file, or stored to cache/memory. There are at least a couple pipeline stages between Execute and Write-Back on any MIPS pipeline.

In the case of signed overflow, the ALU can produce the overflow signal in the same cycle as the result. ISAs with a "flags" register that's updated by most instructions do this all the time as part of the normal operation of an add instruction: if software wants to do something special on signed overflow on x86 or ARM, they'd use a conditional branch on the overflow flag (OF on x86, V on ARM). MIPS is special in that it's difficult to do anything other than take an exception on signed overflow.

Footnote 1: Undefined Behaviour means it's allowed to fault, but not required to fault, and often people would rather it didn't. Compilers want to be able to optimize and introduce transformations that create temporary values that never exist in the C abstract machine, so they must avoid faulting when doing that. Always using subu is a good way to do that, so it doesn't need to track whether the operation and the inputs values are ones that would have happened in the C abstract machine. But in this case it would be legal to use sub.

Another implication of UB is that the compiler's allowed to assume that the -x result didn't overflow to INT_MIN and thus x before and after can't have been INT_MIN.

So if you were doing this as part of finding the absolute value, you'd want to avoid that with 0U - x to convert x to unsigned before doing unsigned subtraction, producing an unsigned result. On a 2's complement machine like MIPS, casting signed int to same width unsigned is free, just use the bit-pattern unchanged. return x<0 ? 0U - x : x;

For that, a 2's complement bithack is useful, which GCC uses with -march=mips32r3 and later. (IDK why it thinks branching is better on CPUs like -march=r14000, a 4-wide out-of-order exec CPU.)

uabs(int):
        sra     $3,$4,31          # broadcast the sign bit
        xor     $2,$4,$3          # ~x      or x
        subu    $2,$2,$3          # ~x-(-1) or x
        jr      $31
        nop