how to copy bytes into xmm0 register

I have the following code which works fine but seems inefficient given the end result only requiring the data in xmm0

         mov rcx, 16                       ; get first word, up to 16 bytes
         mov rdi, CMD                      ; ...and put it in CMD
         mov rsi, CMD_BLOCK
 @@:     lodsb
         cmp al, 0x20
         je @f
         stosb
         loop @b

 @@:     mov rsi, CMD                      ;
         movdqa xmm0, [rsi]                ; mov cmd into xmm0

I'm sure using SSE2, SSE4 etc, there's a better way that doesn't require the use of the CMD buffer but I'm struggling to work out how to do it.

Solution

Your code looks like it gets bytes from CMD_BLOCK up to the first 0x20, and I assume wants zeros above that.

That's not even close to the most efficient way to write a byte-at-a-time copy loop. Never use the LOOP instruction, unless you're specifically tuning for one of the few architectures where it's not slow (e.g. AMD Bulldozer). See Agner Fog's stuff, and other links from the x86 tag wiki. Or use SSE/AVX via C intrinsics, and let a compiler generate the actual asm.

But more importantly, you don't even need a loop if you use SSE instructions.

I'm assuming you zeroed the 16B CMD buffer before starting the copy, otherwise you might as well just do an unaligned load and grab whatever garbage byte are there beyond the data you want.

Things are much easier if you can safely read past the end of CMD_BLOCK without causing a segfault. Hopefully you can arrange for that to be safe. e.g. make sure it's not at the very end of a page that's followed by an unmapped page. If not, you might need to do an aligned load, and then conditionally another aligned load if you didn't get the end of the data.

SSE2 pcmpeqb, find the first match, and zero bytes at that position and higher

section .rodata

ALIGN 32              ; No cache-line splits when taking an unaligned 16B window on these 32 bytes
dd -1, -1, -1, -1
zeroing_mask:
dd  0,  0,  0,  0

ALIGN 16
end_pattern:  times 16   db 0x20    ; pre-broadcast the byte to compare against  (or generate it on the fly)

section .text

    ... as part of some function ...
    movdqu   xmm0, [CMD_BLOCK]       ; you don't have to waste instructions putting pointers in registers.
    movdqa   xmm1, [end_pattern]     ; or hoist this load out of a loop
    pcmpeqb  xmm1, xmm0

    pmovmskb eax, xmm1
    bsr      eax, eax                ; number of bytes of the vector to keep
    jz    @no_match                  ; bsr is weird when input is 0 :(
    neg      rax                     ; go back this far into the all-ones bytes
    movdqu   xmm1, [zeroing_mask + rax]   ; take a window of 16 bytes
    pand     xmm0, xmm1
@no_match:                          ; all bytes are valid, no masking needed
    ;; XMM0 holds bytes from [CMD_BLOCK], up to but not including the first 0x20.

On Intel Haswell, this should have about 11c latency from the input to PCMPEQB being ready until the output of PAND is ready.

If you could use LZCNT instead of BSR, you could avoid the branch. you. Since we want a 16 in the no-match case (so neg eax gives -16, and we load a vector of all-ones), a 16-bit LZCNT will do the trick. (lzcnt ax, ax works, since the upper bytes of RAX are already zero from pmovmskb. Otherwise xor ecx, ecx / lzcnt cx, ax)

This mask-generation idea with an unaligned load to take a window of some all-ones and all-zeros is the same as one of my answers on Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all.

There are alternatives to loading the mask from memory. e.g. broadcast the first all-ones byte to all the higher bytes of the vector, doubling the length of the masked region every time until it's big enough to cover the whole vector, even if the 0xFF byte was the first byte.

    movdqu   xmm0, [CMD_BLOCK]
    movdqa   xmm1, [end_pattern]
    pcmpeqb  xmm1, xmm0             ; 0 0 ... -1 ?? ?? ...

    movdqa   xmm2, xmm1
    pslldq   xmm2, 1
    por      xmm1, xmm2             ; 0 0 ... -1 -1 ?? ...

    movdqa   xmm2, xmm1
    pslldq   xmm2, 2
    por      xmm1, xmm2             ; 0 0 ... -1 -1 -1 -1 ?? ...

    pshufd   xmm2, xmm1, 0b10010000  ; [ a b c d ] -> [ a a b c ]
    por      xmm1, xmm2              ; 0 0 ... -1 -1 -1 -1 -1 -1 -1 -1 ?? ... (8-wide)

    pshufd   xmm2, xmm1, 0b01000000  ; [ abcd ] -> [ aaab ]
    por      xmm1, xmm2              ; 0 0 ... -1 (all the way to the end, no ?? elements left)
    ;; xmm1 = the same mask the other version loads with movdqu based on the index of the first match

    pandn    xmm1, xmm0              ; xmm1 = [CMD_BLOCK] with upper bytes zeroed


    ;; pshufd instead of copy + vector shift  works:
    ;; [ abcd  efgh  hijk  lmno ]
    ;; [ abcd  abcd  efgh  hijk ]  ; we're ORing together so it's ok that the first 4B are still there instead of zeroed.

SSE4.2 PCMPISTRM:

If you XOR with your terminator so that 0x20 bytes become 0x00 bytes, you might be able to use the SSE4.2 string instructions, since they're already set up to handle implicit-length strings where all bytes beyond a 0x00 are invalid. See this tutorial/example, because Intel's documentation just documents everything in full detail, without focusing on the important stuff first.

PCMPISTRM runs with 9 cycle latency on Skylake, 10c latency on Haswell, and 7c latency on Nehalem. So it's about a break-even for latency on Haswell, or actually a loss since we also need a PXOR. Looking for 0x00 bytes and marking elements beyond that is hard-coded, so we need an XOR to turn out 0x20 bytes into 0x00. But it's a lot fewer uops, and less code-size.

;; PCMPISTRM imm8:
;; imm8[1:0] = 00 = unsigned bytes
;; imm8[3:2] = 10 = equals each, vertical comparison.  (always not-equal since we're comparing the orig vector with one where we XORed the match byte)
;; imm8[5:4] = 11 = masked(-): inverted for valid bytes, but not for invalid  (TODO: get the logic on this and PAND vs. PANDN correct)
;; imm8[6] = 1 = output selection (byte mask, not bit mask)
;; imm8[7] = 0 (reserved.  Holy crap, this instruction has room to encode even more functionality??)

movdqu     xmm1, [CMD_BLOCK]

movdqa     xmm2, xmm1
pxor       xmm2, [end_pattern]       ; turn the stop-character into 0x00 so it looks like an implicit-length string
                                     ; also creating a vector where every byte is different from xmm1, so we get guaranteed results for the "valid" part of the vectors (unless the input string can contain 0x0 bytes)
pcmpistrm  xmm1, xmm2, 0b01111000    ; implicit destination operand: XMM0
pand       xmm0, xmm1

I probably don't have the exact args to pcmpistrm correct, but I don't have time to test it or mentally verify it. Suffice it to say, I'm pretty sure it's possible to get it to make a mask that's all-ones before the first zero byte and all-ones from there on.