Search code examples
x86qemuperfriscvinstruction-set

Different ISA binary profiling results contradiction


I am doing profiling of my code written in CPP targeting RISC architecture. I have two binaries generated one for x86 and other for RISC-V. I have done profiling using perf and gprof. As per Theory of RISC and CISC architecture,but what I have got from perf results is contradictory. Could someone tell me what's wrong here.

Result of Perf:

Performance counter stats for './unit_tests'CISC:

    180,899022      task-clock (msec)         #    0,885 CPUs utilized          
             7      context-switches          #    0,039 K/sec                  
             2      cpu-migrations            #    0,011 K/sec                  
         1.350      page-faults               #    0,007 M/sec                  
   588.853.057      cycles                    #    3,255 GHz                    
   863.377.707      instructions              #    1,47  insn per cycle        
   157.440.034      branches                  #  870,320 M/sec                  
       992.067      branch-misses             #    0,63% of all branches        

   0,204509183 seconds time elapsed

Performance counter stats for './unit_tests'RISC:

    693,264322      task-clock (msec)         #    0,999 CPUs utilized          
            28      context-switches          #    0,040 K/sec                  
             1      cpu-migrations            #    0,001 K/sec                  
         2.400      page-faults               #    0,003 M/sec                  
 2.320.185.432      cycles                    #    3,347 GHz                    
 5.467.630.410      instructions              #    2,36  insn per cycle        
   960.171.812      branches                  # 1385,001 M/sec                  
     7.038.808      branch-misses             #    0,73% of all branches        

   0,693978844 seconds time elapsed

As seen from the above results the time elapsed in RISC is more than CISC and also insn per cylce also more in RISC. I am wondering why is it so. Can someone tell me if I am missing something or interpreting the results wrong?


Solution

  • You're profiling qemu interpreting / emulating RISC-V, not the RISC-V "guest" code inside QEMU. QEMU can't do that; it's not a cycle-accurate simulator of anything.

    That's slower and takes more instructions than native code compiled for your x86-64 in the first place.

    Using binfmt_misc to transparently run qemu-riscv64 on RISC-V binaries makes ./unit_tests exactly equivalent to qemu-riscv64 ./unit_tests

    Your test results prove this: perf stat qemu-riscv64 ./unit_tests gave you approximately the same results as what's in your question.


    Somewhat related: Modern Microprocessors A 90-Minute Guide! has some good details about how CPU pipelines work. RISC isn't always better than modern x86 CPUs. They spend enough transistors to run x86-64 code fast.

    You actually would expect more total instructions for the same work from a RISC CPU, just not that many more instructions. Like maybe 1.1x or 1.25x?

    Performance depends on the microarchitecture, not (just) the instruction set. IPC and total time or cycles depends entirely on how aggressive the microarchitecture is at finding instruction-level parallelism. Modern Intel designs are some of the best at that, even in fairly dense CISC x86 code with memory-source instructions being common.