Search code examples
compilationx86simulationbenchmarkinginterpreter

Are there deterministic architecture emulators available?


Does such a thing as a deterministic (as in same result every run) architecture emulator exist? It is to benchmark test compilers/interpreters.

I do not mean an emulator that simply runs your program on whatever simulated architecture, but something that would compute an efficiency/speed index based on the analysis of the generated code (such as, the thing would have a deterministic value for the time taken by each instruction).

I can compute benchmark statistics on a real machine, but a deterministic result would eliminate the particularities of my machine and allow me to see the effect of small optimizations.


Solution

  • Intel's IACA is a static analysis tool. What is IACA and how do I use it?. But it only works for a single loop and doesn't model cache effects, only the pipeline. (And it assumes nearly-ideal OoO scheduling, I think, so probably doesn't find ROB-size limits, only front-end vs. execution port vs. loop-carried dependency latency bottlenecks). Plus IACA has some bugs in its cost model (e.g. its unlamination rules for micro-fusion of indexed addressing modes are wrong for Haswell).

    AFAIK, there are no cycle accurate x86 simulators publicly available for any modern micro-architecture. We only have emulators that don't even try to run at the same speed as any real hardware, just as fast as possible, like BOCHS and qemu. I'm sure Intel and AMD have simulator software internally to validate CPU designs and model their performance, though.

    You could probably assign a cycle cost to every instruction in an interpreting emulator like BOCHS and get a deterministic number, and maybe model the cache, too (there are cache simulators). It would be the same every time you ran it, but it wouldn't correspond to the running time on any real hardware!

    Being deterministic is nowhere near sufficient to be interesting for tuning software. Modern x86 CPUs have a lot of microarchitectural state for out-of-order execution. We can often predict very close to how they'll run a loop (http://agner.org/optimize/, and other performance links in the x86 tag wiki), but on a larger scale there are many things that are only known by the vendors so so we couldn't write a truly accurate simulator even if we had the time. Things like branch-prediction are known in general terms, but the details have not been reverse-engineered in full detail. But branch prediction is a critical part of making a heavily pipelined CPU sustain anywhere near 3 to 4 fused-domain (front-end) uops per clock in real code.

    Things get even more complicated if you want to model a multi-core machine, and SMT / HT adds lots of complexity between threads sharing a core. It's barely deterministic in the real hardware because small timing variations can lead to different threads getting farther out of sync.

    To be really useful, you'd want to be able to test your code on Sandybridge, Haswell, Skylake, Bulldozer, Ryzen, and maybe Silvermont. And maybe different variants of those with different amounts of cache, and server vs. desktop where L3 / memory latency differs. (Many-core servers have significantly worse uncore latency, and lower single-threaded bandwidth even though the aggregate bandwidth is higher.)

    So the whole idea of a deterministic simulator for "the x86 architecture" is weird. You could make one as simply as by giving each instruction a cost of 1 cycle, but that would be totally unrealistic.