While looking at this zenbleed article, it was found that a randomly generated sequence of instructions and the same sequence but with randomized alignment, serialization and speculation fences added produced final states that didn't match.
For example
Original Code Fuzzed Code
-------------------- --------------------
movnti [rbp+0x0],ebx movnti [rbp+0x0],ebx
sfence
rcr dh,1 rcr dh,1
lfence
sub r10, rax sub r10, rax
mfence
rol rbx, cl rol rbx, cl
nop
xor edi,[rbp-0x57] xor edi,[rbp-0x57]
It was mentioned in that article that it could indicate a bug
If the final states don’t match, then there must have been some error in how they were executed micro-architecturally - that could indicate a bug.
Notes
As developers we monitor the macro-architectural state, that’s just things like register values. There is also the micro-architectural state which is mostly invisible to us, like the branch predictor, out-of-order execution state and the instruction pipeline.
Question
Are there any situations when it's not a bug when executed micro-architecturally?
when executed micro-architecturally
This phrasing doesn't make sense. Every instruction sequence has to get executed by the microarchitecture (CPU hardware design). The CPU hardware doesn't have any other way to run machine code.
Any observable (architectural, not timing) results always need to match what would happen if they executed one at a time, in program order (except for memory contents observed from other threads). i.e. out-of-order exec has to preserve the illusion of a serial execution model.
Since lfence
, mfence
, nop
, etc. have no effect on the architectural state (register / memory contents), they shouldn't change anything for single-threaded code. If they do create a difference, that's always a problem. I think that's what you meant to ask, and it's what the quotes are saying.
There are instructions like rdtsc
and rdpmc
that read a timestamp or performance counter; those will of course give different results when you put slow instructions (TSC) or extra instructions/uops into a sequence. rdpmc
is essentially reading microarchitectural counters into architectural state (register values), and rdtsc
is reading time, so there was never any expectation that they'd give the same results with or without serialization.