Set an XMM register to a repeating byte pattern (broadcast a constant byte)

I know that we can do something like this to move a character to a xmm register:

movaps xmm1, xword [.__0x20]

align 16
.__0x20 db 0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20,0x20

but since this is a memory process, i want to know if there is any better way? (also, im talking about SSE2 not other SIMD types ...)

i want to each byte of xmm1 register be 0x20 not only one byte ..

(Editor's note: this can be called a broadcast or splat.
It's what the _mm_set1_epi8(0x20) intrinsic does.)

Solution

With only SSE2, loading the full pattern from memory is generally your best bet.

In your NASM source you can use times 16 db 0x20 for easy maintainability.

With SSE3 you can do 8-byte broadcast loads with movddup. With AVX you can do a 4-byte broadcast-load with vbroadcastss. These broadcast-loads are very good on modern CPUs, running on just the load port, not needing a shuffle uop. i.e. they're exactly as cheap as movaps on CPUs that support them, except for a byte or two more code-size. Same for vbroadcastf128 to YMM registers.

Most compilers don't seem to realize this and will do constant-propagation through _mm_set1 even when that results in a 32 byte constant instead of 4 bytes, even when just mov... loading it ahead of a loop, not folding it into a memory operand for an ALU instruction. (And that's still possible with broadcast-loading when AVX512 is available.) Clang does sometimes take advantage of broadcast loads for simple constants.

AVX2 adds vpbroadcastb/w/d/q, but only dword and qword are pure load uops. Byte and word broadcast-loads need an ALU shuffle uop, so for constant byte patterns you probably want to just broadcast-load a dword that repeats a byte 4 times. (Unless it's an element from a big lookup table, then compress the table by using a byte or word broadcast load, or a pmovsx sign-extending load or whatever).

AVX512 adds vpbroadcastb/w/d/e from an integer register so you could mov eax, 0x20202020 / vpbroadcastd xmm0, eax if you have AVX512VL.

With SSE2 it would take at least 2 instructions including an ALU shuffle, like this, and may not be worth it.

    movd    xmm0, [const_4B]
    pshufd  xmm0, xmm0, 0

Some repeating constants can be generated on the fly in a couple instructions, starting with all-ones from pcmpeqd xmm0,xmm0. See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's guide.

This pattern does not appear to be easy to generate. It's a byte pattern (not word, dword, or qword) and SSE shifts are only available with word granularity at best. However, if we know the bits shifted across byte boundaries are 0, it's fine. e.g.

   pcmpeqd  xmm0, xmm0     ; set1( -1 )
   pabsb    xmm0, xmm0     ; set1_epi8(1)    SSSE3
   pslld    xmm0, 5        ; set1_epi8(1<<5)

; or with only SSE2, something even less efficient like shift / packsswb / shift

This is unlikely to be worth it unless you really want to avoid the possibility of a cache miss for the constant. On average a load will usually come out ahead.