Search code examples
assemblyx86x86-64micro-optimization

How to MOVe 3 bytes (24bits) from memory to a register?


I can move data items stored in memory, to a general-purpose register of my choosing, using the MOV instruction.

MOV r8, [m8]
MOV r16, [m16]
MOV r32, [m32]
MOV r64, [m64]

Now, don’t shoot me, but how is the following achieved: MOV r24, [m24]? (I appreciate the latter is not legal).

In my example, I want to move the characters “Pip”, i.e. 0x706950h, to register rax.

section .data           ; Section containing initialized data

14      DogsName: db "PippaChips"
15      DogsNameLen: equ $-DogsName

I first considered that I could move the bytes separately, i.e. first a byte, then a word, or some combination thereof. However, I cannot reference the ‘top halves’ of eax, rax, so this falls down at the first hurdle, as I would end up over-writing whatever data was moved first.

My solution:

26    mov al, byte [DogsName + 2] ; move the character “p” to register al
27    shl rax, 16                 ; shift bits left by 16, clearing ax to receive characters “pi”
28    mov ax, word [DogsName]     ; move the characters “Pi” to register ax

I could just declare “Pip” as an initialized data item, but the example is just that, an example, I want to understand how to reference 24 bits in assembly, or 40, 48… for that matter.

Is there an instruction more akin to MOV r24, [m24]? Is there a way to select a range of memory addresses, as opposed to providing the offset and specifying a size operator. How to move 3 bytes from memory to register in ASM x86_64?

NASM version 2.11.08 Architecture x86


Solution

  • If you know the 3-byte int isn't at the end of a page, normally you'd do a 4-byte load and mask off the high garbage that came with the bytes you wanted, or simply ignore it if you're doing something with the data that doesn't care about high bits. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?


    Unlike stores1, loading data that you "shouldn't" is never a problem for correctness unless you cross into an unmapped page. (E.g. if db "pip" came at the end of a page, and the following page was unmapped.) But in this case, you know it's part of a longer string, so the only possible downside is performance if a wide load extends into the next cache line (so the load crosses a cache-line boundary). Is it safe to read past the end of a buffer within the same page on x86 and x64?

    Either the byte before or the byte after will always be safe to access, for any 3 bytes (without even crossing a cache-line boundary if the 3 bytes themselves weren't split between two cache lines). Figuring this out at run-time is probably not worth it, but if you know the alignment at compile time, you can do either

    mov   eax, [DogsName-1]     ; if previous byte is in the same page/cache line
    shr   eax, 8
    
    mov   eax, [DogsName]       ; if following byte is in the same page/cache line
    and   eax, 0x00FFFFFF
    

    I'm assuming you want to zero-extend the result into eax/rax, like 32-bit operand-size, instead of merging with the existing high byte(s) of EAX/RAX like 8 or 16-bit operand-size register writes. If you do want to merge, mask the old value and OR. Or if you loaded from [DogsName-1] so the bytes you want are in the top 3 positions of EAX, and you want to merge into ECX: shr ecx, 24 / shld ecx, eax, 24 to shift the old top byte down to the bottom, then shift it back while shifting in the 3 new bytes. (There's no memory-source form of shld, unfortunately. Semi-related: efficiently loading from two separate dwords into a qword.) shld is fast on Intel CPUs (especially Sandybridge and later: 1 uop), but not on AMD (http://agner.org/optimize/).


    Combining 2 separate loads

    There are many ways to do this, but there's no single fastest way across all CPUs, unfortunately. Partial-register writes behave differently on different CPUs. Your way (byte load / shift / word-load into ax) is fairly good on CPUs other than Core2/Nehalem (which will stall to inserting a merging uop when you read eax after assembling it). But start with movzx eax, byte [DogsName + 2] to break the dependency on the old value of rax.

    The classic "safe everywhere" code that you'd expect a compiler to generate would be:

    DEFAULT REL      ; compilers use RIP-relative addressing for static data; you should too.
    movzx   eax, byte [DogsName + 2]   ; avoid false dependency on old EAX
    movzx   ecx, word [DogsName]
    shl     eax, 16
    or      eax, ecx
    

    This takes an extra instruction, but avoids writing any partial registers. However, on CPUs other than Core2 or Nehalem, the best option for 2 loads is writing ax. (Intel P6 before Core2 can't run x86-64 code, and CPUs without partial-register renaming will merge into rax when writing ax). Sandybridge does still rename AX, but the merge only costs 1 uop with no stalling, i.e. same as the OR, but on Core2/Nehalem the front-end stalls for about 3 cycles while inserting the merge uop.

    Ivybridge and later only rename AH, not AX or AL, so on those CPUs, the load into AX is a micro-fused load+merge. Agner Fog doesn't list an extra penalty for mov r16, m on Silvermont or Ryzen (or any other tabs in the spreadsheet I looked at), so presumably other CPUs without partial-reg renaming also execute mov ax, [mem] as a load+merge.

    movzx   eax, byte [DogsName + 2]
    shl     eax, 16
    mov      ax, word [DogsName]
    
    ; when read eax:
      ; * Sandybridge: extra 1 uop inserted to merge
      ; * core2 / nehalem: ~3 cycle stall (unless you don't use it until after the load retires)
      ; * everything else (including IvB+): no penalty, merge already done
    

    Actually, testing alignment at run-time can be done efficiently. Given a pointer in a register, the previous byte is in the same cache line unless the last few 5 or 6 bits of the address are all zero. (i.e. the address is aligned to the start of a cache line). Lets assume cache lines are 64 bytes; all current CPUs use that, and I don't think any x86-64 CPUs with 32-byte lines exist. (And we still definitely avoid page-crossing).

        ; pointer to m24 in RSI
        ; result: EAX = zero_extend(m24)
    
        test   sil, 111111b     ; test all 6 low bits.  There's no TEST r32, imm8, so  REX r8, imm8 is shorter and never slower.
        jz   .aligned_by_64
    
        mov    eax, [rsi-1]
        shr    eax, 8
    .loaded:
    
        ...
        ret    ; end of whatever large function this is part of
    
     ; unlikely block placed out-of-line to keep the common case fast
    .aligned_by_64:
        mov    eax, [rsi]
        and    eax, 0x00FFFFFF
        jmp   .loaded
    

    So in the common case, the extra cost is only one not-taken test-and-branch uop.

    Depending on the CPU, the inputs, and the surrounding code, testing the low 12 bits (to only avoid crossing 4k boundaries) would trade off better branch prediction for some cache line splits within pages, but still never a page-line split. (In that case test esi, (1<<12)-1. Unlike testing sil with an imm8, testing si with an imm16 is not worth the LCP stall on Intel CPUs to save 1 byte of code. And of course if you can have your pointer in ra/b/c/dx, you don't need a REX prefix, and there's even a compact 2-byte encoding for test al, imm8.)

    You could even do this branchlessly, but clearly not worth it vs. just doing 2 separate loads!

        ; pointer to m24 in RSI
        ; result: EAX = zero_extend(m24)
    
        xor    ecx, ecx
        test   sil, 7         ; might as well keep it within a qword if  we're not branching
        setnz  cl             ; ecx = (not_start_of_line) ? : 1 : 0
    
        sub    rsi, rcx       ; normally rsi-1
        mov    eax, [rsi]
    
        shl    ecx, 3         ; cl = 8 : 0
        shr    eax, cl        ; eax >>= 8  : eax >>= 0
    
                              ; with BMI2:  shrx eax, [rsi], ecx  is more efficient
    
        and    eax, 0x00FFFFFF  ; mask off to handle the case where we didn't shift.
    

    True architectural 24-bit load or store

    Architecturally, x86 has no 24-bit loads or stores with an integer register as the destination or source. As Brandon points out, MMX / SSE masked stores (like MASKMOVDQU, not to be confused with pmovmskb eax, xmm0) can store 24 bits from an MMX or XMM reg, given a vector mask with only the low 3 bytes set. But they're almost never useful because they're slow and always have an NT hint (so they write around the cache, and force eviction like movntdq). (The AVX dword/qword masked load/store instruction don't imply NT, but aren't available with byte granularity.)

    AVX512BW (Skylake-server) adds vmovdqu8 which gives you byte-masking for loads and stores with fault-suppression for bytes that are masked off. (I.e. you won't segfault if the 16-byte load includes bytes in an unmapped page, as long as the mask bits aren't set for that byte. But that does cause a big slowdown). So microarchitecturally it's still a 16-byte load, but the effect on architectural state (i.e. everything except performance) is exactly that of a true 3-byte load/store (with the right mask).

    You can use it on XMM, YMM, or ZMM registers.

    ;; probably slower than the integer way, especially if you don't actually want the result in a vector
    mov       eax, 7                  ; low 3 bits set
    kmovw     k1, eax                 ; hoist the mask setup out of a loop
    
    
    ; load:  leave out the {z} to merge into the old xmm0 (or ymm0 / zmm0)
    vmovdqu8  xmm0{k1}{z}, [rsi]    ; {z}ero-masked 16-byte load into xmm0 (with fault-suppression)
    vmovd     eax, xmm0
    
    ; store
    vmovd     xmm0, eax
    vmovdqu8  [rsi]{k1}, xmm0       ; merge-masked 16-byte store (with fault-suppression)
    

    This assembles with NASM 2.13.01. IDK if your NASM is new enough to support AVX512. You can play with AVX512 without hardware using Intel's Software Development Emulator (SDE)

    This looks cool because it's only 2 uops to get a result into eax (once the mask is set up). (However, http://instlatx64.atw.hu/'s spreadsheet of data from IACA for Skylake-X doesn't include vmovdqu8 with a mask, only the unmasked forms. Those do indicate that it's still a single uop load, or micro-fused store just like a regular vmovdqu/a)

    But beware of slowdowns if a 16-byte load would have faulted or crossed a cache-line boundary. I think it internally does do the load and then discards the bytes, with a potentially-expensive special case if a fault needs to be suppressed.

    Also, for the store version, beware that masked stores don't forward as efficiently to loads. (See Intel's optimization manual for more).


    Footnotes:

    1. Wide stores are a problem because even if you replace the old value, you'd be doing a non-atomic read-modify-write, which could break things if that byte you put back was a lock, for example. Don't store outside of objects unless you know what comes next and that it's safe, e.g. padding that you put there to allow this. You could lock cmpxchg a modified 4-byte value into place, to make sure you're not stepping on another thread's update of the extra byte, but obviously doing 2 separate stores is much better for performance than an atomic cmpxchg retry loop.