Search code examples
assemblyx86x86-64masmavx512

How to write an operand that is a 512-bit vector loaded from a N-bit memory location in x86 Assembly


The source Intel manual is here: https://cdrdv2.intel.com/v1/dl/getContent/671110

The registers are specified as m32bcst or m64bcst

Example of an instruction that has a variant that uses this operand

I am interested in writing the variant of the instruction that uses this operand in actual Assembly.

If instead of operand m32bcst we had a variant with operand m32, using MASM Assembly for instance one could write: VMINPS YMM1{k1}{z}, YMM2, DWORD PTR[EAX]

I am not sure what to do in case of an m32bcst operand however.


Solution

  • It varies by assembler. Some support the {1to16} / {1to8} / {1to4} syntax in slides from a 2014 talk introducing AVX-512 at a GCC conference, by Kirill Yukhin of Intel. (Despite it being a GCC talk, the slides use Intel syntax.) Others support that and/or something else.

    • MASM: vminps zmm1, zmm2, DWORD bcst [rax]

    • NASM vminps zmm1, zmm2, [rax] {1to16} (optional dword or qword specifier in the usual place, like dword [rax]{1to16} NASM does not support the bcst keyword.

    • €ASM aka Euro Assembler: vminps ymm1,ymm2,[rax],Bcst=on

    • GAS/clang .intel_syntax is MASM-like in general and supports dword bcst [rax]. But also [rax]{1to16}. (objdump -drwC -Mintel uses dword bcst [rax])

    • AT&T syntax: vminps (%rax){1to16},%zmm2,%zmm1

    The machine code only has 1 bit to encode broadcast vs. regular, so there's no way to broadcast 64-bit pairs of floats for vminps; the broadcast element size has to match the SIMD element size. So €ASM's minimal syntax is sufficient; the others merely provide a way for the assembler to check for a mismatch in what the human thinks the instruction will do.

    Unlike embedded rounding + suppress-all-exceptions which only work with scalar (like vmulss) or 512-bit instructions1, broadcast memory operands do work with 256 and 128-bit bit vectors as well (AVX512VL).

    Broadcast element sizes of 32 and 64-bit are supported; not coincidentally, those are the element sizes that load ports on Intel CPUs can do for free as part of a load uop. (Note that vpbroadcastb/w vec, [mem] need an ALU uop, vpbroadcastd/q only need the load uop.)


    Footnote 1: e.g. vmulps zmm0,zmm1,zmm2{rz-sae} (GAS .intel_syntax / MASM)
    or vmulps zmm0, zmm1, zmm2, {rz-sae} (NASM, with an extra comma before the {})