How to Increase the simulation speed of a gem5 run

I wish to simulate a quite non-trivial program in the gem5 environmnet.

I have three files that I cross-compiled accordingly for the designated ISA:

main.c
my_library.c
my_library.h

I use the command

build/ARM/gem5.opt configs/example/se.py --cpu-type=TimingSimpleCPU -c test/test-progs/hello/src/my_binary

But is there a way, maybe an argument of the se.py script that can make my simulation proceed faster ?

Solution

The default commands are normally the fastest available (and therefore lowest simulation accuracy).

gem5.fast build

A .fast build can run about 20% faster without losing simulation accuracy by disabling some debug related macros:

scons -j `nproc` build/ARM/gem5.fast
build/ARM/gem5.fast configs/example/se.py --cpu-type=TimingSimpleCPU \
  -c test/test-progs/hello/src/my_binary

The speedup is achieved by:

disabling asserts and logging through macros. https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/SConscript#L1395 does:

if 'fast' in needed_envs:
    CPPDEFINES = ['NDEBUG', 'TRACING_ON=0'],

NDEBUG is a standardized way to disable assert: _DEBUG vs NDEBUG

TRACING_ON has effects throughout the source, but the most notable one is at: https://github.com/gem5/gem5/blob/ae7dd927e2978cee89d6828b31ab991aa6de40e2/src/base/trace.hh#L173

#if TRACING_ON

#define DPRINTF(x, ...) do {                     \
    using namespace Debug;                       \
    if (DTRACE(x)) {                             \
        Trace::getDebugLogger()->dprintf_flag(   \
            curTick(), name(), #x, __VA_ARGS__); \
    }                                            \
} while (0)

#else // !TRACING_ON

#define DPRINTF(x, ...) do {} while (0)

#end

which implies that --debug-flags won't do anything basically.

turning on link time optimization: Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build? which might slow down the link time (and therefore how long it takes to recompile after a one line change)

so in general .fast is not worth it if you are developing the simulator, but only when you have done any patches you may have, and just need to run hundreds of simulations as fast as possible with different parameters.

TODO it would be good to benchmark which of the above changes matters the most for runtime, and if the link time is actually significantly slowed down by LTO.

gem5 performance profiling analysis

I'm not aware if a proper performance profiling of gem5 has ever been done to access which parts of the simulation are slow and if there is any way to improve it easily. Someone has to do that at some point and post it at: https://gem5.atlassian.net/browse/GEM5

Options that reduce simulation accuracy

Simulation would also be faster and with lower accuracy without --cpu-type=TimingSimpleCPU :

build/ARM/gem5.opt configs/example/se.py -c test/test-progs/hello/src/my_binary

which uses an even simpler memory model AtomicSimpleCPU.

Other lower accuracy but faster options include:

KVM, but support is not perfect as of 2020, and you need an ARM host to run the simulation on
Gabe's FastModel integration that is getting merged as of 2020, but it requires a FastModel license from ARM, which I think is too expensive for individuals

Also if someone were to implement binary translation in gem5, which is how QEMU goes fast, then that would be an amazing option.