Search code examples
assemblyoptimizationbit-manipulationsigned68000

Fastest way to copy sign value among 68k data registers. (Branchless conditional negation of an integer)


I'm looking for an optimized way to change the sign of a value contained in a Data Register using the sign from a value in another Data Register.

Source register will contain either 1 or -1 and destination be always positive therefore all is needed is multiplication. Here's some pseudo code of what I'd do:

MOVE.B #-1,D0
MOVE.W #173,D1
MULS.W D0,D1

After that simple math D1 would carry the sign from D0 and become either -173 or 173. Which is the desired result but MULS takes up to 70 cycles and I'm hoping to save some by finding a trick to somehow only "copy" the sign .

Last word: trick should be branchless as branching is what I'm trying to prevent by copying the sign in the first place.

Thanks in advance for any information or advice.


Solution

  • You could branchlessly select between x and -x, using a bit-hack.

    But you can do better: 2's complement negation can be expressed as -x == ~x + 1, and both the NOT and increment can be expressed in terms of XOR and SUB with -1. But the same operations with a 0 are no-ops, leaving the value unchanged.

    (This trick is often used for 2's complement absolute value, where the 0 or -1 is obtained from arithmetic right shift, x >> 31, to copy the sign bit to all bits of a register. i.e. applying your conditional-sign-flip operation to the same number that gave us the sign bit. https://graphics.stanford.edu/~seander/bithacks.html#IntegerAbs)

    Of course, you need a WORD sized -1, not just BYTE, since you need to cover all bits of value you're negating (or not); as Erik points out, this means move.w #-1, d0 or ext.w if you can't promote your byte value to a word at the origin.

    ;; d0 = word  1 or -1
       asr.w     #1, d0
    ;; d0 = word  0 or -1
    
       eor.w     d0, d1           ;  ~d1 
       sub.w     d0, d1           ;  ~d1 - 1  (or unchanged for d0=0)
    
    ; d1 = d1 or -d1  according to d0
    

    If you want to negate based on the sign of some other number, generate a 0 / -1 directly instead of ever creating a 1 / -1.
    (i.e. an integer copysign, a bit like multiplying by signum(y) if ISO C had a signum function, but without zeroing on y=0.)

    Normally you'd use x >> 15 with an arithmetic right shift, but M68K immediate shifts only allow a count from 1..8. (And are slow for large counts on CPUs without a barrel shifter). So you'd actually want get 16 copies of the sign bit by sign-extending to 32-bit and then swapping halves to put it in place:

    ;; manufacture a 0 / -1 word according to d0.w < 0
       ext.l   d0       ; high 16 bits = 0 / -1
       swap    d0       ; those are now the low bits
    
    ; optionally: ext.l  d0   ; to make a 32-bit 0 / -1
    

    For a 32-bit source, you might tst.l d0,d0 / smi d0 (Set if MInus) to produce a byte-sized 0 / -1, then sign-extend that. (Although only 68020 can do that in one extb.l instruction; 68000 would need ext.w + ext.l)


    Or if you have a MC68020 or later CPU, you can use a sign-extending bit-field extract of just the sign bit. bfexts d0 {15:1}, d0. It's a 4-byte instruction, so same code-size as ext.l + swap, but it extends to 32-bit and could work on a 32-bit source register without additional instructions. And even for 16-bit integers, it's a single instruction instead of 2, so it may be good, especially if it's executing from cache? Or not since it seems to be slower.

    Other alternatives:

    • moveq #15 into a register for an asr by 15 or 31. Slower even on m68k than ext/swap.
    • rotate left by 1, and-immediate with 1 to isolate the sign bit as a 0 / 1, and neg. Works easily for 32-bit inputs, maybe good there especially on 68000 where a shift by 31 might be slow. But rol is slowish on 68020.

    https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf includes instruction timings for a 68020, as "best case" (overlap with execution of previous instruction, and hot in cache), "cached case" (just cache, no overlap), and "worst case" (neither)

    • bfexts Dn : 5 cycles (best) / 8 (cached) / 8 (worst)
    • EXT Dn : 1 (best) / 4 (cached) / 4 (worst)
    • SWAP Rx,Ry : 1 (best) / 4 (cached) / 4 (worst)

    So maybe bfexts isn't a good choice for 16-bit inputs, when ext/swap would do the job.

    • ASR Dn: 3 (best) / 6 (cached) / 6 (worst). Interestingly, immediate-count logical shifts are faster than this, but arithmetic shifts are the same. (68020 apparently has a barrel shifter; performance doesn't depend on the count. 68000 may not, IDK.)
    • MOVEQ #<data>, Dn: 0 (best) / 2 (cached) / 3 (worst)
    • ROL Dn: 5 (best) / 8 (cached) / 8 (worst)
    • ANDI #<data>,Dn: 0 (best) / 2 (cached) / 3 (worst)
    • EOR Dn,Dn: 0 (best) / 2 (cached) / 3 (worst)
    • SUB EA,Dn: 0 (best) / 2 (cached) / 3 (worst). Fetch EA time for a Dn source is 0 cycles.

    Overlap (creating the best case) depends on how slow the previous instruction was, I think, especially memory access rather than core cycles; the manual I linked has some details, but they also make a big point that you can't just add up best or worst cases except as lower/upper bounds; you need to benchmark to find typical run times. IDK how data-dependencies affect this.

    These are 68020 timings, since we had a choice there of bfexts vs. ext/swap. On 68000, ext/swap is almost certainly better than any alternative.