How do AVX512 rounding modes work (or is NDISASM simply confused)?

I’m trying to understand the specific AVX512F instruction vcvtps2udq.

The signature of the instruction is VCVTPS2UDQ zmm1 {k1}{z}, zmm2/m512/m32bcst{er}. The manual info is below.

In an attempt to understand the new rounding modes, the following code snippet is assembled with NASM (2.12.02)

vcvtps2udq zmm0,zmm1
vcvtps2udq zmm0,zmm1,{rz-sae}
vcvtps2udq xmm0,xmm1

Deassembling the results with NDISASM (2.12.02) gives a lot of confusion and the following codes:

62F17C4879C1      vcvtps2udq zmm0,zmm1
62F17C7879C1      vcvtps2udq xmm0,xmm1
62F17C0879C1      vcvtps2udq xmm0,xmm1

Question: the second line is deassembled with xmm registers instead of a zmm register (that I would have expected). Has the zero rounding mode (rz-sae) something to do with it. Or is just NDISASM wrong and cannot distinguish between opcodes 62F17C7879C1 and 62F17C0879C1.

The Intel instruction set reference manual has the following description:

Converts sixteen packed single-precision floating-point values in the source operand to sixteen unsigned doubleword integers in the destination operand.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register or the embedded rounding control bits. If a converted result cannot be represented in the destination format, the floating-point invalid exception is raised, and if this exception is masked, the integer value 2w – 1 is returned, where w represents the number of bits in the destination format.

The source operand is a ZMM/YMM/XMM register, a 512/256/128-bit memory location, or a 512/256/128-bit vector broadcasted from a 32-bit memory location. The destination operand is a ZMM/YMM/XMM register conditionally updated with writemask k1.

Solution

The opcodes are encoded as 0x62 P0 P1 P2 ... see here section 4.2. In this case, the P2 bytes are

P2
48  <- vcvtps2udq zmm0,zmm1
78  <- vcvtps2udq zmm0,zmm1,{rz-sae}
08  <- vcvtps2udq xmm0,xmm1

breaking that down further those are the following fields

                       zmm  zmm+sae  xmm
EVEX.aaa  = P2[2:0]     0     0       0
EVEXV'    = P2[3]       1     1       1
EVEX.b    = P2[4]       0     1       0  "Broadcast/RC/SAE Context"
EVEX.L'L  = P2[6:5]     2     3       0  "Vector length/RC"
EVEX.z    = P2[7]       0     0       0

So the different fields are EVEX.b and EVEX.L'L. According to the docs, if b is not set, then L'L is the SIMD length, so 0 = xmm and 2 = zmm. If b is set, the L'L is reinterpreted as the static rounding mode and the length is assumed to be zmm (512 bits).

NDISASM is not interpreting the EVEX.B bit correctly, and thus the EVEX.L'L field either.