I'm implementing randomized rounding mode for evaluating numerical stability of floating point (FP) computations in C99 code. This is the idea of this computational chemistry work, but updated to handle the SSE and AVX instructions common on recent x86_64 architectures. Randomized rounding injects assembly instructions to set the rounding mode to either round to +inf or round to -inf in a way specific to each and every FP instruction, randomly. With multiple runs, you get samples of a distribution of possible results.
To do this correctly, I need to know which instructions involve FP rounding. There seems to be a 4-way classification:
movq
, addq
, imulq
movaps
, or, doing some computation on FP values but not in way that involves rounding, e.g. orpsfldcw
), e.g. addss, sqrtss
I can catalog all the instructions I'm seeing in a given chunk of generated assembly, and read documentation (such as linked here) to learn which of category a given instruction is in, but this is slow and error-prone, and I don't know if different code, or different compiler settings, will target instructions I haven't yet manually classified.
What is the easiest way, given an x86 instruction, to classify it as above? Is there already a list somewhere with the equivalent info? Or, are there tips on terms to search for within the pages at https://www.felixcloutier.com/x86/ that can automatically determine this classification?
TL:DR: I think it should work to look for instructions that can raise a "precision" exception: that means they do rounding.
But you might not want to override rounding mode for instructions that round or convert to integer with the current rounding mode, that might be too big a rounding error.
@chtz suggests a useful heuristic of changing rounding before every ss
/sd
/ ps
/pd
, instruction, except for a few common ones that you blacklist, like [v]movaps/d
, [v]movups/d
, [v]movss/d
, [v]shuf*
, [v]blend*
, [v]perm*
, bitwise booleans, etc. Also [v]ucomis/d
/ [v]comis/d
and packed-SIMD compares like [v]cmpps
; I think only denormals-are-zero would affect compares, not actual rounding modes.
You might want to make sure rounding mode is nearest-even for conversion to integer, like [v]cvtps2dq
, or generally for all conversions to/from integer which all start with cvt
mnemonics, although maybe still randomize for double-to-float ([v]cvtpd2ps
). int->fp has to round for large uneven integers, like greater than 2^23 for float.
Using the AVX-512 EVEX encoding to override the rounding mode on a per-instruction basis without changing mxcsr
could improve performance a lot if you have an AVX-512-capable CPU. See below.
Instruction databases:
If you want to use something like that, I think it should work to check for the possibility to raise an FP Precision exception, aka inexact. That's only possible if an instruction can sometimes produce a result that needs to be rounded. (Normally all FP exceptions are masked, so they only set a bit in MXCSR to record that they happened. We still talk about "raising" an FP exception regardless of whether they're masked, with the understanding that it's normally just setting a sticky bit in the FP environment, not actually trapping and running an exception handler and having the OS kill your process with SIGFPE.)
A few instructions can round but don't raise exceptions about it, like roundsd
when the rounding mode is overridden with its immediate. But rounding all the way to an integer-valued double is normally part of an algorithm which you'd expect to break if you changed nearbyint
to floor
, ceil
, or trunc
. Usually roundsd
is used with a specific rounding mode specified in the immediate, not using MXCSR at all, but it is possible to have it use the current default mode. (In which case it can raise a precision exception, so it would show up according to that heuristic.)
Probably also leave cvtsd2si
and cvtps2dq
alone, making sure they uses nearest-even as the program expected for (int)nearbyint
or lrint
. (Assuming your programs don't use fenv
functions to set rounding modes.)
I don't think there are any instructions affected by the rounding mode which can never raise a precision exception.
Looking at whether any FP exceptions are possible is another way to tell whether it's an "FP math" instruction vs. just bitwise or shuffles or whatever. Instructions like ucomisd
scalar double-precision compare can't raise many exceptions (the U for Unordered means that one or both operand being NaN isn't a cause for concern). But it doesn't suppress SNaN, only QNaN, and can raise a denormal exception.
I think specifically the inexact aka precision exception might be a useful litmus test. e.g.
movaps
lists SIMD Floating-Point Exceptions: None because it doesn't do math at all.
addps
lists SIMD Floating-Point Exceptions:
Overflow, Underflow, Invalid, Precision, Denormal.
cvtpd2dq
lists SIMD Floating-Point Exceptions ¶
Invalid, Precision. Packed conversion from FP to integer, using the current rounding mode. You probably want to whitelist this one as not getting munged, since +- 1 integer is a lot farther than +-1ulp.
ucomisd
lists SIMD Floating-Point Exceptions ¶
Invalid (if SNaN operands), Denormal. (And no more; comparisons don't round or over/underflow. And the "u" stands for unordered, meaning it doesn't treat Quiet NaN as exception-worthy, only Signalling NaN. Other math instructions like sqrtsd
produce QNaN on invalid inputs, you only get SNaN if you create one manually.)
"SIMD Floating Point" means it cares about MXCSR, not the x87 control / status registers. addss
also uses those; it's a scalar instruction using SIMD registers.
Hopefully your program won't use any legacy x87 instructions at all, if you compile for x86-64 and don't use 80-bit long double
. But if you do, the rounding mode and precision control in the x87 control word (fldcw
) is totally separate from SSE/AVX rounding mode in MXCSR. (https://masm32.com/masmcode/rayfil/tutorial/).
AVX-512 can override the rounding on a per-instruction basis, using a couple bits inside the EVEX prefix. This is much more efficient than changing MXCSR frequently, so I'd recommend it if you have access to a machine that supports AVX-512.
First saving instructions, at least a store + ldmxcsr
, and ldmxcsr
isn't cheap, like 4 uops on Intel SKX with 3 cycle throughput. (vs. 0.5c throughput for FP math). Or on Zen4, 6 uops with 21 cycle throughput. Plus, if a CPU doesn't rename the whole MXCSR including the rounding control, loading MXCSR would have to serialize execution of FP math instructions wrt. to it. I'm not sure if current Intel CPUs like Skylake or Alder Lake rename MXCSR; I found a couple SO answers mentioning the concept:
Observing x86 register dependencies quotes some Intel doc mentioning a perf event for some microarchitecture: MXCSR rename stall cycles - Stalls due to the MXCSR register rename occurring too close to a previous MXCSR rename. So that's bad for your use-case of changing it very often in FP-intensive code.
What resource_stall.other might mean - some mentions of MXCSR resource-stalls on Whiskey Lake, but IDK if that implies renaming or just waiting for older instructions to drain when the rounding mode changes.
Update: Cores that support SMT (hyperthreading) probably do need to rename the whole MXCSR, so instructions from separate logical cores can be in flight using different rounding modes. There might not be a huge number of entries to rename it onto. So that includes AMD Zen and Intel P-cores, but not Intel E-cores; that would be worth testing.
Due to encoding limitations, Embedded-Rounding is only available for scalar and 512-bit vectors, not 128 or 256. And only when the operands are registers, not a memory source. (Update: AVX10.2 will introduce ER support for 256-bit vectors. Still not 128-bit, but promoting 128-bit SIMD ops to 256-bit is often fine unless you were depending on the zero-extension in the high half, and has much lower performance cost than 512-bit.)
Overriding the rounding mode also implies SAE, suppress-all-exceptions.
# GAS AT&T syntax
vaddss {rn-sae}, %xmm0, %xmm1, %xmm1 # nearest-even
vaddss {rz-sae}, %xmm0, %xmm1, %xmm1 # toward zero
vaddss {rd-sae}, %xmm0, %xmm1, %xmm1 # down
vaddss {ru-sae}, %xmm0, %xmm1, %xmm1 # up
vaddps {ru-sae}, %zmm0, %zmm1, %zmm1 # also available for 512-bit vectors
#vaddps {ru-sae}, %ymm0, %ymm1, %ymm1 # but not 128 or 256
#vaddss {rz-sae}, (%rsi), %xmm31, %xmm31 # nope, memory source operand
# vaddps {rz-sae}, (%rsi), %zmm31, %zmm31 # nope
Note the optional {er}
in the AVX-512 EVEX form of addss
(https://www.felixcloutier.com/x86/addss), but only for the 512-bit form of vaddps
, not the EVEV xmm/ymm forms. See also AVX512 vector length and SAE control
Given those limitations on when it can be used, you might need some spare registers to use it. e.g. load into a temporary register like x/zmm31, then use a scalar of 512-bit instruction.
So maybe compile with AVX2+FMA, leaving you zmm16..31 available as temporaries. (Which you can dirty without worrying about vzeroupper
)
vsubps (%rsi), %xmm1, %xmm0 # 128-bit SIMD original
# round-down replacement
vmovups (%rsi), %xmm31
vsubps {rd-sae}, %zmm31, %zmm1, %zmm31 # write a 512-bit temporary
vmovaps %xmm31, %xmm0 # write the original destination with original width
----
vsubps %xmm2, %xmm1, %xmm0 # 128-bit SIMD original
# round-down replacement
vsubps {rd-sae}, %zmm2, %zmm1, %zmm31 # write a 512-bit temporary
vmovaps %xmm31, %xmm0 # write the original destination with original width
I'm not 100% sure it's important to avoid writing %zmm0
on any CPUs if you're already using the high halves of ymm16..31, but I think it might be depending on exactly how SSE/AVX transitions and saved-uppers are implemented.
vzeroupper
.If the original was using 256-bit YMM vectors, it will be using vzeroupper
at some point before running legacy-SSE instructions. In that case:
vsubps (%rsi), %ymm1, %ymm0 # original already using YMM upper halves
vmovups (%rsi), %ymm31 # replacement
vsubps {rd-sae}, %zmm31, %zmm1, %zmm0
For scalar it's easier and with less performance cost: we don't have to use any ZMM 512-bit vector operations, so we don't have any of the effects like SIMD instructions lowering CPU frequency or shutting down port 1 for SIMD uops on Intel CPUs. Or for Zen4, 512-bit uops occupy an execution unit for 2 cycles. (But are still single-uop for the front-end, which is excellent for front-end bottlenecks and out-of-order exec being able to see far.)
vsubsd (%rsi), %xmm1, %xmm0 # scalar original
vmovsd (%rsi), %xmm31 # replacement still only needs scalar
vsubsd {rd-sae}, %xmm31, %xmm1, %xmm0
This does significantly increase code size, e.g. objdump Intel-syntax disassembly:
0: c5 f3 5c 06 vsubsd xmm0,xmm1,QWORD PTR [rsi] # orig: 4 bytes, or 5 with some registers or instructions
4: 62 61 ff 08 10 3e vmovsd xmm31,QWORD PTR [rsi] # replacement, 2x 6B
a: 62 91 f7 38 5c c7 vsubsd xmm0,xmm1,xmm31{rd-sae}
x87 is trickier because even fld
/ fstp
(or a memory-source version of any other instruction) do conversion on the fly between the 80-bit internal format and 32/64-bit float/double in memory. That does involve rounding when storing, but not when loading, so you actually would want to randomize the x87 rounding mode before fst
/ fstp
unless the destination is m80, storing the internal format unmodified.
I'm not sure what happens if the precision-control bits are set to 24-bit mantissa precision (32-bit float) when loading a 64-bit double (53-bit mantissa). The manual doesn't mention it, so probably the precision bits don't introduce new places to round, only affect how much precision is needed in the first place to speed up calculations, controlling rounding in cases where rounding would be required even with them set to full precision.
fld
lists Floating-Point Exceptions ¶
#IS
Stack underflow or overflow occurred.
(can happen with any FPU instruction that pushes the register stack.)#IA
(Source operand is an SNaN. Does not occur if the source operand is in double extended-precision floating-point format (FLD m80fp or FLD ST(i). So apparently fld
doesn't treat QNaN as worthy of an exception.#D
Source operand is a denormal value. Does not occur if the source operand is in double extended-precision floating-point format.fsincos
lists like fld
but #IA
lists "Source operand is an SNaN value, ∞, or unsupported format.". Plus:
#U
(underflow) Result is too small for destination format.#P
(precision) Value cannot be represented exactly in destination format.