performance x86 floating-point simd avx512

vgetmantps vs andpd instructions for getting the mantissa of float

For skylakex (agner fog's instruction tables):

+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
|      Instruction      |  Operands   | µops fused domain | µops unfused domain | µops each port | Latency | Reciprocal throughput |
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+
| VGETMANTPS/PD         | v,v,v       |                 1 |                   1 | p01/05         |       4 | 0.5-1                 |
| AND/ANDN/OR/ XORPS/PD | x,x / y,y,y |                 1 |                   1 | p015           |       1 | 0.33                  |
+-----------------------+-------------+-------------------+---------------------+----------------+---------+-----------------------+

Does that mean using a bitmask and logical and to get the mantissa of a float is faster than using the vgetmantps instruction?

How much is the latency for transferring the number from float to int and back to float?

Solution

For implementing log(x), you want the mantissa and exponent as float, and vgetmantps / vgetexpps are perfect for it. Efficient implementation of log2(__m256d) in AVX2. This is what those instructions are for, and do speed up a fast approximation to log(2). (Plus it can normalize the significand to -0.5 .. +0.5 instead of 1 .. 2.0 or other neat ranges to create input for a polynomial approximation to log(x+1) or whatever. See its docs.)

If you only want the mantissa as an integer, then sure AND away the other bits and you're done in one instruction.

(But remember that for a NaN, the mantissa is the NaN payload, so if you need to do something different for NaN then you need to check the exponent.)

How much is the latency for transferring the number from float to int and back to float?

You already have Agner Fog's instruction tables (https://agner.org/optimize/). On Skylake (SKL and SKX) VCVT(T) PS2DQ is 4c latency for the FMA ports, and so is the other direction.

Or are you asking about bypass latency for using the output of an FP instruction like andps as the input to an integer instruction?

Agner Fog's microarch PDF has some info about bypass latency for sending data between vec-int and fp domains, but not many specifics.

Skylake's bypass latency is weird: unlike on previous uarches, it depends what port the instruction actually picked. andps has no bypass latency between FP instructions if it runs on port 5, but if it runs on p0 or p1, it has an extra 1c of latency.

See Intel's optimization manual for a table of domain-crossing latencies broken down by domain+execution-port.

(And just to be extra weird, this bypass-delay latency affects that register forever, even after it has definitely written back to a physical register and isn't being forwarded over the bypass network. vpaddd xmm0, xmm1, xmm2 has 2c latency for both inputs if either input came from vmulps. But some shuffles and other instructions work in either domain. It was a while since I experimented with this, and I didn't check my notes, so this example might not be exactly right, but it's something like this.)

(Intel's optimization manual doesn't mention this permanent effect which lasts until you overwrite the architectural register with a new value. So be careful about creating FP constants ahead of a loop with integer instructions.)