assembly floating-point x86-64 sse rounding-error

Classification of x86 instructions according to floating point rounding mode sensitivity?

I'm implementing randomized rounding mode for evaluating numerical stability of floating point (FP) computations in C99 code. This is the idea of this computational chemistry work, but updated to handle the SSE and AVX instructions common on recent x86_64 architectures. Randomized rounding injects assembly instructions to set the rounding mode to either round to +inf or round to -inf in a way specific to each and every FP instruction, randomly. With multiple runs, you get samples of a distribution of possible results.

To do this correctly, I need to know which instructions involve FP rounding. There seems to be a 4-way classification:

Nothing at all to do with FP, e.g. movq, addq, imulq
Involving FP values, but no FP computation so nothing to do with rounding, e.g. movss, movaps, or, doing some computation on FP values but not in way that involves rounding, e.g. orps
Does FP computation that may involve rounding, and rounding mode is given explicitly in the instruction, e.g. roundss
Does FP computation that may involve rounding, and rounding mode is implicit (according to state previously set by ldmxcsr or fldcw), e.g. addss, sqrtss

I can catalog all the instructions I'm seeing in a given chunk of generated assembly, and read documentation (such as linked here) to learn which of category a given instruction is in, but this is slow and error-prone, and I don't know if different code, or different compiler settings, will target instructions I haven't yet manually classified.

What is the easiest way, given an x86 instruction, to classify it as above? Is there already a list somewhere with the equivalent info? Or, are there tips on terms to search for within the pages at https://www.felixcloutier.com/x86/ that can automatically determine this classification?

Solution

TL:DR: I think it should work to look for instructions that can raise a "precision" exception: that means they do rounding.
But you might not want to override rounding mode for instructions that round or convert to integer with the current rounding mode, that might be too big a rounding error.

@chtz suggests a useful heuristic of changing rounding before every ss/sd / ps/pd, instruction, except for a few common ones that you blacklist, like [v]movaps/d, [v]movups/d, [v]movss/d, [v]shuf*, [v]blend*, [v]perm*, bitwise booleans, etc. Also [v]ucomis/d / [v]comis/d and packed-SIMD compares like [v]cmpps; I think only denormals-are-zero would affect compares, not actual rounding modes.

You might want to make sure rounding mode is nearest-even for conversion to integer, like [v]cvtps2dq, or generally for all conversions to/from integer which all start with cvt mnemonics, although maybe still randomize for double-to-float ([v]cvtpd2ps). int->fp has to round for large uneven integers, like greater than 2^23 for float.

Using the AVX-512 EVEX encoding to override the rounding mode on a per-instruction basis without changing mxcsr could improve performance a lot if you have an AVX-512-capable CPU. See below.

Instruction databases:

https://www.felixcloutier.com/x86/ - scraped from Intel's PDFs, with scripts that are available.
https://github.com/HJLebbink/asm-dude/wiki/ - similar, and the github project is for an IDE plugin that uses a database of instructions locally to show info about them, or something. Its instruction list is also built from scraping Intel's PDFs.
https://uops.info/xml.html uses an XML file listing details of all instructions, taken from Intel's X86 Encoder Decoder (XED) library. uops.info has some Python scripts to do stuff with it, apparently.

If you want to use something like that, I think it should work to check for the possibility to raise an FP Precision exception, aka inexact. That's only possible if an instruction can sometimes produce a result that needs to be rounded. (Normally all FP exceptions are masked, so they only set a bit in MXCSR to record that they happened. We still talk about "raising" an FP exception regardless of whether they're masked, with the understanding that it's normally just setting a sticky bit in the FP environment, not actually trapping and running an exception handler and having the OS kill your process with SIGFPE.)

A few instructions can round but don't raise exceptions about it, like roundsd when the rounding mode is overridden with its immediate. But rounding all the way to an integer-valued double is normally part of an algorithm which you'd expect to break if you changed nearbyint to floor, ceil, or trunc. Usually roundsd is used with a specific rounding mode specified in the immediate, not using MXCSR at all, but it is possible to have it use the current default mode. (In which case it can raise a precision exception, so it would show up according to that heuristic.)

Probably also leave cvtsd2si and cvtps2dq alone, making sure they uses nearest-even as the program expected for (int)nearbyint or lrint. (Assuming your programs don't use fenv functions to set rounding modes.)

I don't think there are any instructions affected by the rounding mode which can never raise a precision exception.

Looking at whether any FP exceptions are possible is another way to tell whether it's an "FP math" instruction vs. just bitwise or shuffles or whatever. Instructions like ucomisd scalar double-precision compare can't raise many exceptions (the U for Unordered means that one or both operand being NaN isn't a cause for concern). But it doesn't suppress SNaN, only QNaN, and can raise a denormal exception.

I think specifically the inexact aka precision exception might be a useful litmus test. e.g.

movaps lists SIMD Floating-Point Exceptions: None because it doesn't do math at all.
addps lists SIMD Floating-Point Exceptions: Overflow, Underflow, Invalid, Precision, Denormal.
cvtpd2dq lists SIMD Floating-Point Exceptions ¶ Invalid, Precision. Packed conversion from FP to integer, using the current rounding mode. You probably want to whitelist this one as not getting munged, since +- 1 integer is a lot farther than +-1ulp.
ucomisd lists SIMD Floating-Point Exceptions ¶ Invalid (if SNaN operands), Denormal. (And no more; comparisons don't round or over/underflow. And the "u" stands for unordered, meaning it doesn't treat Quiet NaN as exception-worthy, only Signalling NaN. Other math instructions like sqrtsd produce QNaN on invalid inputs, you only get SNaN if you create one manually.)

"SIMD Floating Point" means it cares about MXCSR, not the x87 control / status registers. addss also uses those; it's a scalar instruction using SIMD registers.

Hopefully your program won't use any legacy x87 instructions at all, if you compile for x86-64 and don't use 80-bit long double. But if you do, the rounding mode and precision control in the x87 control word (fldcw) is totally separate from SSE/AVX rounding mode in MXCSR. (http://www.ray.masmcode.com/tutorial/index.html).

AVX-512 Embedded Rounding overrides

AVX-512 can override the rounding on a per-instruction basis, using a couple bits inside the EVEX prefix. This is much more efficient than changing MXCSR frequently, so I'd recommend it if you have access to a machine that supports AVX-512.

First saving instructions, at least a store + ldmxcsr, and ldmxcsr isn't cheap, like 4 uops on Intel SKX with 3 cycle throughput. (vs. 0.5c throughput for FP math). Or on Zen4, 6 uops with 21 cycle throughput. Plus, if a CPU doesn't rename the whole MXCSR including the rounding control, loading MXCSR would have to serialize execution of FP math instructions wrt. to it. I'm not sure if current Intel CPUs like Skylake or Alder Lake rename MXCSR; I found a couple SO answers mentioning the concept:

Observing x86 register dependencies quotes some Intel doc mentioning a perf event for some microarchitecture: MXCSR rename stall cycles - Stalls due to the MXCSR register rename occurring too close to a previous MXCSR rename. So that's bad for your use-case of changing it very often in FP-intensive code.
What resource_stall.other might mean - some mentions of MXCSR resource-stalls on Whiskey Lake, but IDK if that implies renaming or just waiting for older instructions to drain when the rounding mode changes.

Due to encoding limitations, Embedded-Rounding is only available for scalar and 512-bit vectors, not 128 or 256. And only when the operands are registers, not a memory source.

Overriding the rounding mode also implies SAE, suppress-all-exceptions.

# GAS AT&T syntax
vaddss {rn-sae}, %xmm0, %xmm1, %xmm1   # nearest-even
vaddss {rz-sae}, %xmm0, %xmm1, %xmm1   # toward zero
vaddss {rd-sae}, %xmm0, %xmm1, %xmm1   # down
vaddss {ru-sae}, %xmm0, %xmm1, %xmm1   # up

vaddps {ru-sae}, %zmm0, %zmm1, %zmm1   # also available for 512-bit vectors 
#vaddps {ru-sae}, %ymm0, %ymm1, %ymm1   # but not 128 or 256

#vaddss {rz-sae}, (%rsi), %xmm31, %xmm31   # nope, memory source operand
# vaddps {rz-sae}, (%rsi), %zmm31, %zmm31  # nope

Note the optional {er} in the AVX-512 EVEX form of addss (https://www.felixcloutier.com/x86/addss), but only for the 512-bit form of vaddps, not the EVEV xmm/ymm forms. See also AVX512 vector length and SAE control

Given those limitations on when it can be used, you might need some spare registers to use it. e.g. load into a temporary register like x/zmm31, then use a scalar of 512-bit instruction.

So maybe compile with AVX2+FMA, leaving you zmm16..31 available as temporaries. (Which you can dirty without worrying about vzeroupper)

 vsubps (%rsi), %xmm1, %xmm0       # 128-bit SIMD original

                                   # round-down replacement
 vmovups  (%rsi), %xmm31
 vsubps  {rd-sae}, %zmm31, %zmm1, %zmm31   # write a 512-bit temporary
 vmovaps   %xmm31, %xmm0                   # write the original destination with original width

  ----
 vsubps %xmm2, %xmm1, %xmm0       # 128-bit SIMD original

                                   # round-down replacement
 vsubps  {rd-sae}, %zmm2, %zmm1, %zmm31   # write a 512-bit temporary
 vmovaps   %xmm31, %xmm0                  # write the original destination with original width

I'm not 100% sure it's important to avoid writing %zmm0 on any CPUs if you're already using the high halves of ymm16..31, but I think it might be depending on exactly how SSE/AVX transitions and saved-uppers are implemented.

Dynamically determining where a rogue AVX-512 instruction is executing reports one write of ZMM0 was a problem in a program that never used vzeroupper.
Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?
Why is this SSE code 6 times slower without VZEROUPPER on Skylake? (just SSE/AVX transitions in Haswell/Ice Lake vs. Skylake).

If the original was using 256-bit YMM vectors, it will be using vzeroupper at some point before running legacy-SSE instructions. In that case:

 vsubps (%rsi), %ymm1, %ymm0       # original already using YMM upper halves

 vmovups  (%rsi), %ymm31           # replacement
 vsubps  {rd-sae}, %zmm31, %zmm1, %zmm0

For scalar it's easier and with less performance cost: we don't have to use any ZMM 512-bit vector operations, so we don't have any of the effects like SIMD instructions lowering CPU frequency or shutting down port 1 for SIMD uops on Intel CPUs. Or for Zen4, 512-bit uops occupy an execution unit for 2 cycles. (But are still single-uop for the front-end, which is excellent for front-end bottlenecks and out-of-order exec being able to see far.)

 vsubsd (%rsi), %xmm1, %xmm0       # scalar original

 vmovsd  (%rsi), %xmm31            # replacement still only needs scalar
 vsubsd  {rd-sae}, %xmm31, %xmm1, %xmm0

This does significantly increase code size, e.g. objdump Intel-syntax disassembly:

   0:   c5 f3 5c 06             vsubsd xmm0,xmm1,QWORD PTR [rsi]  # orig: 4 bytes, or 5 with some registers or instructions

   4:   62 61 ff 08 10 3e       vmovsd xmm31,QWORD PTR [rsi]      # replacement, 2x 6B
   a:   62 91 f7 38 5c c7       vsubsd xmm0,xmm1,xmm31{rd-sae}

Legacy x87

x87 is trickier because even fld / fstp (or a memory-source version of any other instruction) do conversion on the fly between the 80-bit internal format and 32/64-bit float/double in memory. That does involve rounding when storing, but not when loading, so you actually would want to randomize the x87 rounding mode before fst / fstp unless the destination is m80, storing the internal format unmodified.

I'm not sure what happens if the precision-control bits are set to 24-bit mantissa precision (32-bit float) when loading a 64-bit double (53-bit mantissa). The manual doesn't mention it, so probably the precision bits don't introduce new places to round, only affect how much precision is needed in the first place to speed up calculations, controlling rounding in cases where rounding would be required even with them set to full precision.

x87 fld lists Floating-Point Exceptions ¶
- #IS Stack underflow or overflow occurred. (can happen with any FPU instruction that pushes the register stack.)
- #IA (Source operand is an SNaN. Does not occur if the source operand is in double extended-precision floating-point format (FLD m80fp or FLD ST(i). So apparently fld doesn't treat QNaN as worthy of an exception.
- #D Source operand is a denormal value. Does not occur if the source operand is in double extended-precision floating-point format.
x87 fsincos lists like fld but #IA lists "Source operand is an SNaN value, ∞, or unsupported format.". Plus:
- #U (underflow) Result is too small for destination format.
- #P (precision) Value cannot be represented exactly in destination format.