I am doing profiling of my code written in CPP targeting RISC architecture. I have two binaries generated one for x86 and other for RISC-V. I have done profiling using perf and gprof. As per Theory of RISC and CISC architecture,but what I have got from perf results is contradictory. Could someone tell me what's wrong here.
Result of Perf:
Performance counter stats for './unit_tests'CISC:
180,899022 task-clock (msec) # 0,885 CPUs utilized
7 context-switches # 0,039 K/sec
2 cpu-migrations # 0,011 K/sec
1.350 page-faults # 0,007 M/sec
588.853.057 cycles # 3,255 GHz
863.377.707 instructions # 1,47 insn per cycle
157.440.034 branches # 870,320 M/sec
992.067 branch-misses # 0,63% of all branches
0,204509183 seconds time elapsed
Performance counter stats for './unit_tests'RISC:
693,264322 task-clock (msec) # 0,999 CPUs utilized
28 context-switches # 0,040 K/sec
1 cpu-migrations # 0,001 K/sec
2.400 page-faults # 0,003 M/sec
2.320.185.432 cycles # 3,347 GHz
5.467.630.410 instructions # 2,36 insn per cycle
960.171.812 branches # 1385,001 M/sec
7.038.808 branch-misses # 0,73% of all branches
0,693978844 seconds time elapsed
As seen from the above results the time elapsed in RISC is more than CISC and also insn per cylce also more in RISC. I am wondering why is it so. Can someone tell me if I am missing something or interpreting the results wrong?
You're profiling qemu interpreting / emulating RISC-V, not the RISC-V "guest" code inside QEMU. QEMU can't do that; it's not a cycle-accurate simulator of anything.
That's slower and takes more instructions than native code compiled for your x86-64 in the first place.
Using binfmt_misc to transparently run qemu-riscv64
on RISC-V binaries makes ./unit_tests
exactly equivalent to qemu-riscv64 ./unit_tests
Your test results prove this: perf stat qemu-riscv64 ./unit_tests
gave you approximately the same results as what's in your question.
Somewhat related: Modern Microprocessors A 90-Minute Guide! has some good details about how CPU pipelines work. RISC isn't always better than modern x86 CPUs. They spend enough transistors to run x86-64 code fast.
You actually would expect more total instructions for the same work from a RISC CPU, just not that many more instructions. Like maybe 1.1x or 1.25x?
Performance depends on the microarchitecture, not (just) the instruction set. IPC and total time or cycles depends entirely on how aggressive the microarchitecture is at finding instruction-level parallelism. Modern Intel designs are some of the best at that, even in fairly dense CISC x86 code with memory-source instructions being common.