It is possible move 8 bits from an XMM register to memory without using general purpose registers?

I need to move 1 byte from an xmm register to memory without using general purpose registers. And also I can't use SSE4.1. It is possible?

Solution

Normally you'd want to avoid this in the first place. For example, instead of doing separate byte stores, can you do one wider load and merge (pand/pandn/por if you don't have pblendvb), then store back the merge result?

That's not thread-safe (non-atomic RMW of the unmodified bytes), but as long as you know the bytes you're RMWing don't extend past the end of the array or struct, and no other threads are doing the same thing to other elements in the same array/struct, it's the normal way to do stuff like upper-case every lower-case letter in a string while leaving other bytes unmodified.

Single-uop stores are only possible from vector registers in 4, 8, 16, 32, or 64-byte sizes, except with AVX-512BW masked stores with only 1 element unmasked. Narrower stores like pextrb involve a shuffle uop to extract the 2 or 1 byte to be stored.

The only good way to truly store exactly 1 byte without GP integer regs is with SSE4.1 pextrb [mem], xmm0, 0..15. That's still a shuffle + store even with an immediate 0 on current CPUs. If you can safely write 2 bytes at the destination location, SSE2 pextrw is usable.

You could use an SSE2 maskmovdqu byte-masked store (with a 0xff,0,0,... mask), but you don't want to because it's much slower than movd eax, xmm0 / mov [mem], al. e.g. on Skylake, 10 uops, 1 per 6 cycle throughput.

And it's worse than that if you want to reload the byte after, because (unlike AVX / AVX-512 masked stores), maskmovdqu has NT semantics like movntps (bypass cache, or evict the cache line if previously hot).

If your requirement is fully artificial and you just want to play silly computer tricks (avoiding ever having your data in registers), you could also set up scratch space e.g. on the stack and use movsb to copy it:

;; with destination address already in RDI
    lea  rsi, [rsp-4]          ; scratch space in the red zone below RSP on non-Windows
    movd  [rsi], xmm0
    movsb                   ; copy a byte, [rdi] <- [rsi], incrementing RSI and RDI

This is obviously slower than the normal way and needed an extra register (RSI) for the tmp buffer address. And you need the exact destination address in RDI, not [rel foo] static storage or any other flexible addressing mode.

pop can also copy mem-to-mem, but is only available with 16-bit and 64-bit operand-size, so it can't save you from needing RSI and RDI.

Since the above way needs an extra register, it's worse in pretty much every way than the normal way:

   movd  esi, xmm0            ; pick any register.
   mov   [rdi], sil           ; al..dl would avoid needing a REX prefix for low-8


;; or even use a register where you can read the low and high bytes separately
   movd  eax, xmm0
   mov   [rdi], al            ; no REX prefix needed, more compact than SIL
   mov   [rsi], ah            ; scatter two bytes reasonably efficiently
   shr   eax, 16              ; bring down the next 2 bytes

(Reading AH has an extra cycle of latency on current Intel CPUs, but it's fine for throughput, and we're storing here anyway so latency isn't much of a factor.)

xmm -> GP integer transfers are not slow on most CPUs. (Bulldozer-family is the outlier, but it's still comparable latency to store/reload; Agner Fog said in his microarch guide (https://agner.org/optimize/) he found AMD's optimization-manual suggestion to store/reload was not faster.)

It's hard to imagine a case where movsb could be better, since you already need a free register for that way, and movsb is multiple uops. Possibly if bottlenecked on port 0 uops for movd r32, xmm on current Intel CPUs? (https://uops.info/)