How to set all the values in AVX ymm register to be the same (all are 0/1/specific value)?

I'm new to assembly code and SSE/AVX instructions. Now, I want to assign a specific value to all locations in 256-bit YMM registers, but I don't know if the final result is correct.

To assign 0 or 1 to ymm0:

__asm__ __volatile__(
    "vpxor    %%ymm0, %%ymm0, %%ymm0\n\t" // all are 0
    or
    "VPCMPEQB    %%ymm0, %%ymm0, %%ymm0\n\t" // all are 1
: : :);

GDB result shows that:

// all are 0
ymm0
{v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
v4_double = {0x0, 0x0, 0x0, 0x0}, 
v32_int8 = {0x0 <repeats 32 times>}, 
v16_int16 = {0x0 <repeats 16 times>}, 
v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
v4_int64 = {0x0, 0x0, 0x0, 0x0}, 
v2_int128 = {0x0, 0x0}}

// all are 1
ymm0           
{v8_float = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff}, 
v4_double = {0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff, 0x7fffffffffffffff}, 
v32_int8 = {0xff <repeats 32 times>}, 
v16_int16 = {0xffff <repeats 16 times>}, 
v8_int32 = {0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff}, 
v4_int64 = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff}, 
v2_int128 = {0xffffffffffffffffffffffffffffffff, 0xffffffffffffffffffffffffffffffff}}

To set 0xA to all locations (both high and low 128-bits) in ymm0:

__asm__ __volatile__(
      "movq $0xaaaaaaaaaaaaaaaa, %%rcx\n"
      "vmovq %%rcx, %%xmm0\n" 
      "vpbroadcastq %%xmm0, %%ymm0\n": : :);

GDB result shows that:

ymm0           
{v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
v4_double = {0x0, 0x0, 0x0, 0x0}, 
v32_int8 = {0xaa <repeats 32 times>}, 
v16_int16 = {0xaaaa <repeats 16 times>}, 
v8_int32 = {0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa, 0xaaaaaaaa}, 
v4_int64 = {0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaa}, 
v2_int128 = {0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}}

Questions:

What does the GDB result (structure) mean? E.g., v8_float, v4_double, v32_int8, etc.
In the second case (0xA), why are the v8_float and v4_double always 0?
How can I assign the value (e.g., 'a') to all locations in YMM (including both high and low 128-bits)?

P.S VPBROADCAST — Load Integer and Broadcast

Solution

First of all, your inline asm is broken: missing a "%ymm0" clobber to tell the compiler you wrote that register. You even used asm("" : : :) Extended asm syntax to explicitly tell the compiler there were no clobbers. Or better, https://gcc.gnu.org/wiki/DontUseInlineAsm - write a separate function, or use intrinsics and look at compiler-generated asm.

v8_float means to interpret the 256 bits as a Vector of 8x float. i.e. __m256 in Intel Intrinsics.

v32_int8 is a vector of 32x int8_t, printing each byte separately. You can use p /x $ymm0.v8_int32 if that's how you want to look at it.

(2) Integer 0xa is the bit-pattern for a very tiny subnormal float or double. Try putting that in as the "Hexadecimal Representation" on https://www.h-schmidt.net/FloatConverter/IEEE754.html.

(3) You already did broadcast 0xa to all 64 nibbles in your 32-byte YMM register, as you can see from the v2_int128 = {0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, 0xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}} output showing both halves being all 0xaa bytes.

If you actually wanted _mm256_set1_epi8(0x0a) (broadcast that to every byte), you should have written 0x0a0a0a0a instead of 0xaaaaaaaa. (There's no need to use a qword immediate; vpbroadcastd runs just as fast, but mov $0x0a0a0a0a, %eax is a smaller and faster instruction.)

https://godbolt.org/z/z18nMT3fd shows GCC and clang compiling that a function that returns _mm256_set1_epi8(0x0a) (and another that broadcasts a function arg, not a constant). GCC11.3 does constant-propagation and loads 32 bytes from .rodata. GCC12.1 unwisely uses your strategy of mov r64, imm64 and vmovq.

Clang uses vbroadcastsd (which is the same thing as vpbroadcastq) from an 8-byte memory source. 4-byte broadcast-loads are just as efficient. (Unlike byte or word which cost an extra ALU uop: https://uops.info/ and https://agner.org/optimize/)

AVX-512 introduces vpbroadcastb/w/d/q ymm0, eax which combines the vmovd with the broadcast. But without that, yeah you generally want AVX2 vpbroadcastb/w/d/q ymm, xmm if data is coming from an integer register. (I'm using Intel syntax here, like the vendor manuals; reverse it as usual for AT&T syntax if you prefer that.)

AFAIK, there isn't a good trick to generate 0xa (0b1010) on the fly from all-ones. Some other constants like 0x1 or 0x8000000 can be generated with 2 instructions, starting with vpcmpeqd same,same,same for all-ones. (See What are the best instruction sequences to generate vector constants on the fly?)