I wish to simulate a quite non-trivial program in the gem5 environmnet.
I have three files that I cross-compiled accordingly for the designated ISA:
I use the command
build/ARM/gem5.opt configs/example/se.py --cpu-type=TimingSimpleCPU -c test/test-progs/hello/src/my_binary
But is there a way, maybe an argument of the se.py script that can make my simulation proceed faster ?
The default commands are normally the fastest available (and therefore lowest simulation accuracy).
gem5.fast
build
A .fast
build can run about 20% faster without losing simulation accuracy by disabling some debug related macros:
scons -j `nproc` build/ARM/gem5.fast
build/ARM/gem5.fast configs/example/se.py --cpu-type=TimingSimpleCPU \
-c test/test-progs/hello/src/my_binary
The speedup is achieved by:
disabling asserts and logging through macros. https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/SConscript#L1395 does:
if 'fast' in needed_envs:
CPPDEFINES = ['NDEBUG', 'TRACING_ON=0'],
NDEBUG
is a standardized way to disable assert
: _DEBUG vs NDEBUG
TRACING_ON
has effects throughout the source, but the most notable one is at: https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/base/trace.hh#L173
#if TRACING_ON
#define DPRINTF(x, ...) do { \
using namespace Debug; \
if (DTRACE(x)) { \
Trace::getDebugLogger()->dprintf_flag( \
curTick(), name(), #x, __VA_ARGS__); \
} \
} while (0)
#else // !TRACING_ON
#define DPRINTF(x, ...) do {} while (0)
#end
which implies that --debug-flags
won't do anything basically.
turning on link time optimization: Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build? which might slow down the link time (and therefore how long it takes to recompile after a one line change)
so in general .fast
is not worth it if you are developing the simulator, but only when you have done any patches you may have, and just need to run hundreds of simulations as fast as possible with different parameters.
TODO it would be good to benchmark which of the above changes matters the most for runtime, and if the link time is actually significantly slowed down by LTO.
gem5 performance profiling analysis
I'm not aware if a proper performance profiling of gem5 has ever been done to access which parts of the simulation are slow and if there is any way to improve it easily. Someone has to do that at some point and post it at: https://gem5.atlassian.net/browse/GEM5
Options that reduce simulation accuracy
Simulation would also be faster and with lower accuracy without --cpu-type=TimingSimpleCPU
:
build/ARM/gem5.opt configs/example/se.py -c test/test-progs/hello/src/my_binary
which uses an even simpler memory model AtomicSimpleCPU
.
Other lower accuracy but faster options include:
Also if someone were to implement binary translation in gem5, which is how QEMU goes fast, then that would be an amazing option.
Related