assembly optimization bit-manipulation signed 68000

Fastest way to copy sign value among 68k data registers. (Branchless conditional negation of an integer)

I'm looking for an optimized way to change the sign of a value contained in a Data Register using the sign from a value in another Data Register.

Source register will contain either 1 or -1 and destination be always positive therefore all is needed is multiplication. Here's some pseudo code of what I'd do:

MOVE.B #-1,D0
MOVE.W #173,D1
MULS.W D0,D1

After that simple math D1 would carry the sign from D0 and become either -173 or 173. Which is the desired result but MULS takes up to 70 cycles and I'm hoping to save some by finding a trick to somehow only "copy" the sign .

Last word: trick should be branchless as branching is what I'm trying to prevent by copying the sign in the first place.

Thanks in advance for any information or advice.

Solution

You could branchlessly select between x and -x, using a bit-hack.

But you can do better: 2's complement negation can be expressed as -x == ~x + 1, and both the NOT and increment can be expressed in terms of XOR and SUB with -1. But the same operations with a 0 are no-ops, leaving the value unchanged.

(This trick is often used for 2's complement absolute value, where the 0 or -1 is obtained from arithmetic right shift, x >> 31, to copy the sign bit to all bits of a register. i.e. applying your conditional-sign-flip operation to the same number that gave us the sign bit. https://graphics.stanford.edu/~seander/bithacks.html#IntegerAbs)

Of course, you need a WORD sized -1, not just BYTE, since you need to cover all bits of value you're negating (or not); as Erik points out, this means move.w #-1, d0 or ext.w if you can't promote your byte value to a word at the origin.

;; d0 = word  1 or -1
   asr.w     #1, d0
;; d0 = word  0 or -1

   eor.w     d0, d1           ;  ~d1 
   sub.w     d0, d1           ;  ~d1 - 1  (or unchanged for d0=0)

; d1 = d1 or -d1  according to d0

If you want to negate based on the sign of some other number, generate a 0 / -1 directly instead of ever creating a 1 / -1.
(i.e. an integer copysign, a bit like multiplying by signum(y) if ISO C had a signum function, but without zeroing on y=0.)

Normally you'd use x >> 15 with an arithmetic right shift, but M68K immediate shifts only allow a count from 1..8. (And are slow for large counts on CPUs without a barrel shifter). So you'd actually want get 16 copies of the sign bit by sign-extending to 32-bit and then swapping halves to put it in place:

;; manufacture a 0 / -1 word according to d0.w < 0
   ext.l   d0       ; high 16 bits = 0 / -1
   swap    d0       ; those are now the low bits

; optionally: ext.l  d0   ; to make a 32-bit 0 / -1

For a 32-bit source, you might tst.l d0,d0 / smi d0 (Set if MInus) to produce a byte-sized 0 / -1, then sign-extend that. (Although only 68020 can do that in one extb.l instruction; 68000 would need ext.w + ext.l)

Or if you have a MC68020 or later CPU, you can use a sign-extending bit-field extract of just the sign bit. bfexts d0 {15:1}, d0. It's a 4-byte instruction, so same code-size as ext.l + swap, but it extends to 32-bit and could work on a 32-bit source register without additional instructions. And even for 16-bit integers, it's a single instruction instead of 2, so it may be good, especially if it's executing from cache? Or not since it seems to be slower.

Other alternatives:

moveq #15 into a register for an asr by 15 or 31. Slower even on m68k than ext/swap.
rotate left by 1, and-immediate with 1 to isolate the sign bit as a 0 / 1, and neg. Works easily for 32-bit inputs, maybe good there especially on 68000 where a shift by 31 might be slow. But rol is slowish on 68020.

https://www.nxp.com/docs/en/data-sheet/MC68020UM.pdf includes instruction timings for a 68020, as "best case" (overlap with execution of previous instruction, and hot in cache), "cached case" (just cache, no overlap), and "worst case" (neither)

bfexts Dn : 5 cycles (best) / 8 (cached) / 8 (worst)
EXT Dn : 1 (best) / 4 (cached) / 4 (worst)
SWAP Rx,Ry : 1 (best) / 4 (cached) / 4 (worst)

So maybe bfexts isn't a good choice for 16-bit inputs, when ext/swap would do the job.

ASR Dn: 3 (best) / 6 (cached) / 6 (worst). Interestingly, immediate-count logical shifts are faster than this, but arithmetic shifts are the same. (68020 apparently has a barrel shifter; performance doesn't depend on the count. 68000 may not, IDK.)
MOVEQ #<data>, Dn: 0 (best) / 2 (cached) / 3 (worst)
ROL Dn: 5 (best) / 8 (cached) / 8 (worst)
ANDI #<data>,Dn: 0 (best) / 2 (cached) / 3 (worst)
EOR Dn,Dn: 0 (best) / 2 (cached) / 3 (worst)
SUB EA,Dn: 0 (best) / 2 (cached) / 3 (worst). Fetch EA time for a Dn source is 0 cycles.

Overlap (creating the best case) depends on how slow the previous instruction was, I think, especially memory access rather than core cycles; the manual I linked has some details, but they also make a big point that you can't just add up best or worst cases except as lower/upper bounds; you need to benchmark to find typical run times. IDK how data-dependencies affect this.

These are 68020 timings, since we had a choice there of bfexts vs. ext/swap. On 68000, ext/swap is almost certainly better than any alternative.