I have the following code which works fine but seems inefficient given the end result only requiring the data in xmm0
mov rcx, 16 ; get first word, up to 16 bytes
mov rdi, CMD ; ...and put it in CMD
mov rsi, CMD_BLOCK
@@: lodsb
cmp al, 0x20
je @f
stosb
loop @b
@@: mov rsi, CMD ;
movdqa xmm0, [rsi] ; mov cmd into xmm0
I'm sure using SSE2, SSE4 etc, there's a better way that doesn't require the use of the CMD buffer but I'm struggling to work out how to do it.
Your code looks like it gets bytes from CMD_BLOCK up to the first 0x20, and I assume wants zeros above that.
That's not even close to the most efficient way to write a byte-at-a-time copy loop. Never use the LOOP instruction, unless you're specifically tuning for one of the few architectures where it's not slow (e.g. AMD Bulldozer). See Agner Fog's stuff, and other links from the x86 tag wiki. Or use SSE/AVX via C intrinsics, and let a compiler generate the actual asm.
But more importantly, you don't even need a loop if you use SSE instructions.
I'm assuming you zeroed the 16B CMD buffer before starting the copy, otherwise you might as well just do an unaligned load and grab whatever garbage byte are there beyond the data you want.
Things are much easier if you can safely read past the end of CMD_BLOCK without causing a segfault. Hopefully you can arrange for that to be safe. e.g. make sure it's not at the very end of a page that's followed by an unmapped page. If not, you might need to do an aligned load, and then conditionally another aligned load if you didn't get the end of the data.
section .rodata
ALIGN 32 ; No cache-line splits when taking an unaligned 16B window on these 32 bytes
dd -1, -1, -1, -1
zeroing_mask:
dd 0, 0, 0, 0
ALIGN 16
end_pattern: times 16 db 0x20 ; pre-broadcast the byte to compare against (or generate it on the fly)
section .text
... as part of some function ...
movdqu xmm0, [CMD_BLOCK] ; you don't have to waste instructions putting pointers in registers.
movdqa xmm1, [end_pattern] ; or hoist this load out of a loop
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
bsr eax, eax ; number of bytes of the vector to keep
jz @no_match ; bsr is weird when input is 0 :(
neg rax ; go back this far into the all-ones bytes
movdqu xmm1, [zeroing_mask + rax] ; take a window of 16 bytes
pand xmm0, xmm1
@no_match: ; all bytes are valid, no masking needed
;; XMM0 holds bytes from [CMD_BLOCK], up to but not including the first 0x20.
On Intel Haswell, this should have about 11c latency from the input to PCMPEQB being ready until the output of PAND is ready.
If you could use LZCNT instead of BSR, you could avoid the branch. you. Since we want a 16 in the no-match case (so neg eax gives -16, and we load a vector of all-ones), a 16-bit LZCNT will do the trick. (lzcnt ax, ax
works, since the upper bytes of RAX are already zero from pmovmskb
. Otherwise xor ecx, ecx
/ lzcnt cx, ax
)
This mask-generation idea with an unaligned load to take a window of some all-ones and all-zeros is the same as one of my answers on Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all.
There are alternatives to loading the mask from memory. e.g. broadcast the first all-ones byte to all the higher bytes of the vector, doubling the length of the masked region every time until it's big enough to cover the whole vector, even if the 0xFF byte was the first byte.
movdqu xmm0, [CMD_BLOCK]
movdqa xmm1, [end_pattern]
pcmpeqb xmm1, xmm0 ; 0 0 ... -1 ?? ?? ...
movdqa xmm2, xmm1
pslldq xmm2, 1
por xmm1, xmm2 ; 0 0 ... -1 -1 ?? ...
movdqa xmm2, xmm1
pslldq xmm2, 2
por xmm1, xmm2 ; 0 0 ... -1 -1 -1 -1 ?? ...
pshufd xmm2, xmm1, 0b10010000 ; [ a b c d ] -> [ a a b c ]
por xmm1, xmm2 ; 0 0 ... -1 -1 -1 -1 -1 -1 -1 -1 ?? ... (8-wide)
pshufd xmm2, xmm1, 0b01000000 ; [ abcd ] -> [ aaab ]
por xmm1, xmm2 ; 0 0 ... -1 (all the way to the end, no ?? elements left)
;; xmm1 = the same mask the other version loads with movdqu based on the index of the first match
pandn xmm1, xmm0 ; xmm1 = [CMD_BLOCK] with upper bytes zeroed
;; pshufd instead of copy + vector shift works:
;; [ abcd efgh hijk lmno ]
;; [ abcd abcd efgh hijk ] ; we're ORing together so it's ok that the first 4B are still there instead of zeroed.
If you XOR with your terminator so that 0x20 bytes become 0x00 bytes, you might be able to use the SSE4.2 string instructions, since they're already set up to handle implicit-length strings where all bytes beyond a 0x00 are invalid. See this tutorial/example, because Intel's documentation just documents everything in full detail, without focusing on the important stuff first.
PCMPISTRM runs with 9 cycle latency on Skylake, 10c latency on Haswell, and 7c latency on Nehalem. So it's about a break-even for latency on Haswell, or actually a loss since we also need a PXOR. Looking for 0x00 bytes and marking elements beyond that is hard-coded, so we need an XOR to turn out 0x20 bytes into 0x00. But it's a lot fewer uops, and less code-size.
;; PCMPISTRM imm8:
;; imm8[1:0] = 00 = unsigned bytes
;; imm8[3:2] = 10 = equals each, vertical comparison. (always not-equal since we're comparing the orig vector with one where we XORed the match byte)
;; imm8[5:4] = 11 = masked(-): inverted for valid bytes, but not for invalid (TODO: get the logic on this and PAND vs. PANDN correct)
;; imm8[6] = 1 = output selection (byte mask, not bit mask)
;; imm8[7] = 0 (reserved. Holy crap, this instruction has room to encode even more functionality??)
movdqu xmm1, [CMD_BLOCK]
movdqa xmm2, xmm1
pxor xmm2, [end_pattern] ; turn the stop-character into 0x00 so it looks like an implicit-length string
; also creating a vector where every byte is different from xmm1, so we get guaranteed results for the "valid" part of the vectors (unless the input string can contain 0x0 bytes)
pcmpistrm xmm1, xmm2, 0b01111000 ; implicit destination operand: XMM0
pand xmm0, xmm1
I probably don't have the exact args to pcmpistrm correct, but I don't have time to test it or mentally verify it. Suffice it to say, I'm pretty sure it's possible to get it to make a mask that's all-ones before the first zero byte and all-ones from there on.