Operands for VPCMPB

I see on the Intel intrinsics guide that you can use vpcmpb without an immediate to achieve the effect of equality comparison: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512&expand=6816,804,804,4867,351,804,4222,914&text=vpcmpb

I try to write the following assembly instruction: vpcmpb %zmm30, %zmm0, %k1 (g++ syntax), compare equal zmm30 and zmm0, write result to k1. However, the assembler complains about wrong number of operands. What is going on here?

Solution

There are 3 valid machine opcodes for doing this:

vpcmpeqb k, zmm, zmm
(EVEX form of the MMX/SSE2/AVX2 66 0F 74 opcode for [v]pcmpeq [xy]mm, [xy]mm. These have never taken an immediate, with only eq and signed gt predicates being available as different opcodes)
vpcmpb or vpcmpub with immediate 0
(new instructions that only have EVEX forms, EVEX.512.66.0F3A.W0 3F or 3E).

In asm source, assemblers let you use vpcmpleb k, zmm, zmm for example as a more meaningful way to write vpcmpb k, z, z, 2, as recommended in Table 5-17 in Intel's vol.2 manual. i.e. with the predicate as part of the mnemonic, implying the immediate.

That table includes a line for VPCMPEQ* reg1, reg2, reg3 -> VPCMP* reg1, reg2, reg3, 0, but the shorter no-immediate form takes precedence for vpcmpeqb k, zmm, zmm in actual assemblers.

NASM source mixed with objdump -S -drwC -Mintel disassembly. (Same results assembling with gas .intel_syntax noprefix):

                                vpcmpeqb k1, zmm0, zmm1
   0:   62 f1 7d 48 74 c9       vpcmpeqb k1,zmm0,zmm1    # 74 opcode

                                vpcmpb k1, zmm0, zmm1, 0
   6:   62 f3 7d 48 3f c9 00    vpcmpeqb k1,zmm0,zmm1    # 3f opcode

                                vpcmpequb k1, zmm0, zmm1
   d:   62 f3 7d 48 3e c9 00    vpcmpequb k1,zmm0,zmm1   # 3e opcode

                                vpcmpub k1, zmm0, zmm1, 0
  14:   62 f3 7d 48 3e c9 00    vpcmpequb k1,zmm0,zmm1   # 3e opcode

Interestingly, NASM/GAS will assemble vpcmpb k1, zmm0, zmm1, 0 as written, to the form with the immediate. But objdump will disassemble that back into vpcmpeqb k1,zmm0,zmm1, same as the no-immediate opcode, so this is one of the cases where a disassemble/reassemble round trip would change the machine code. (But not the architectural effect of the instruction, of course)

NASM / GAS don't optimize vpcmpequb into vpcmpeqb for you, so always avoid the unsigned version when comparing for integer equality.

Errors in the intrinsics guide

If you're writing in asm, look at the asm reference manual (HTML extract https://www.felixcloutier.com/x86/vpcmpb:vpcmpub or Intel's original PDFs that's scraped from), not the Intrinsics guide. Especially when you run into any mystery or disagreement between what something says and what tools and/or CPUs seem to be doing!

The intrinsics guide is certainly known to have errors (although they do get fixed as people report them on Intel's forums). Especially likely to see errors in the parts that aren't important for correctness of using C/C++ intrinsics.

It's not impossible for Intel's asm manuals to have errors, too, but not anything as major as leaving out an entire machine opcode form of an instruction for an already-released instruction set.

In no way is vpcmpb k, zmm, zmm ever valid without an explicit immediate, in real asm source or as a descriptions of machine code, so yes this is definitely an error in the intrinsics guide.

The vpcmpeqb %zmm, %zmm, %k asm syntax with reversed operand-list and $immediate is "AT&T syntax". It happens to be the one GAS uses by default for .s / .S files, but you can use .intel_syntax noprefix.

It normally doesn't make sense to use inline asm for single instructions - compilers normally do a good enough job with intrinsics, although perhaps not always for AVX-512 mask stuff.