I'm new to assembly code and SSE/AVX instructions. Now, I want to assign a specific value to all locations in 256-bit YMM registers, but I don't know if the final result is correct.
ymm0
:__asm__ __volatile__(
"vpxor %%ymm0, %%ymm0, %%ymm0\n\t" // all are 0
or
"VPCMPEQB %%ymm0, %%ymm0, %%ymm0\n\t" // all are 1
: : :);
GDB result shows that:
// all are 0
ymm0
{v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
v4_double = {0x0, 0x0, 0x0, 0x0},
v32_int8 = {0x0 <repeats 32 times>},
v16_int16 = {0x0 <repeats 16 times>},
v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
v4_int64 = {0x0, 0x0, 0x0, 0x0},
v2_int128 = {0x0, 0x0}}
// all are 1
ymm0
{v8_float = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff},
v4_double = {0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff},
v32_int8 = {0xff <repeats 32 times>},
v16_int16 = {0xffff <repeats 16 times>},
v8_int32 = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff},
v4_int64 = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff},
v2_int128 = {0xffffffffffffffffffffffffffffffff, 0xffffffffffffffffffffffffffffffff}}
ymm0
:__asm__ __volatile__(
"movq $0xaaaaaaaaaaaaaaaa, %%rcx\n"
"vmovq %%rcx, %%xmm0\n"
"vpbroadcastq %%xmm0, %%ymm0\n": : :);
GDB result shows that:
ymm0
{v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
v4_double = {0x0, 0x0, 0x0, 0x0},
v32_int8 = {0xaa <repeats 32 times>},
v16_int16 = {0xaaaa <repeats 16 times>},
v8_int32 = {0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa},
v4_int64 = {0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa},
v2_int128 = {0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}}
Questions:
First of all, your inline asm is broken: missing a "%ymm0"
clobber to tell the compiler you wrote that register. You even used asm("" : : :)
Extended asm syntax to explicitly tell the compiler there were no clobbers. Or better, https://gcc.gnu.org/wiki/DontUseInlineAsm - write a separate function, or use intrinsics and look at compiler-generated asm.
v8_float
means to interpret the 256 bits as a Vector of 8x float
. i.e. __m256
in Intel Intrinsics.
v32_int8
is a vector of 32x int8_t
, printing each byte separately. You can use p /x $ymm0.v8_int32
if that's how you want to look at it.
(2) Integer 0xa
is the bit-pattern for a very tiny subnormal float or double. Try putting that in as the "Hexadecimal Representation" on https://www.h-schmidt.net/FloatConverter/IEEE754.html.
(3) You already did broadcast 0xa
to all 64 nibbles in your 32-byte YMM register, as you can see from the v2_int128 = {0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}}
output showing both halves being all 0xaa
bytes.
If you actually wanted _mm256_set1_epi8(0x0a)
(broadcast that to every byte), you should have written 0x0a0a0a0a
instead of 0xaaaaaaaa
. (There's no need to use a qword immediate; vpbroadcastd
runs just as fast, but mov $0x0a0a0a0a, %eax
is a smaller and faster instruction.)
https://godbolt.org/z/z18nMT3fd shows GCC and clang compiling that a function that returns _mm256_set1_epi8(0x0a)
(and another that broadcasts a function arg, not a constant). GCC11.3 does constant-propagation and loads 32 bytes from .rodata
. GCC12.1 unwisely uses your strategy of mov r64, imm64 and vmovq
.
Clang uses vbroadcastsd
(which is the same thing as vpbroadcastq
) from an 8-byte memory source. 4-byte broadcast-loads are just as efficient. (Unlike byte or word which cost an extra ALU uop: https://uops.info/ and https://agner.org/optimize/)
AVX-512 introduces vpbroadcastb/w/d/q ymm0, eax
which combines the vmovd
with the broadcast. But without that, yeah you generally want AVX2 vpbroadcastb/w/d/q ymm, xmm
if data is coming from an integer register. (I'm using Intel syntax here, like the vendor manuals; reverse it as usual for AT&T syntax if you prefer that.)
AFAIK, there isn't a good trick to generate 0xa (0b1010) on the fly from all-ones. Some other constants like 0x1 or 0x8000000 can be generated with 2 instructions, starting with vpcmpeqd same,same,same
for all-ones. (See What are the best instruction sequences to generate vector constants on the fly?)