Why does cmp instruction cost too much time?

I am trying to do profile with libunwind (using linux perf), with perf top monitoring the target process, I get this assembly time cost screen:

  0.19 │       mov    %rcx,0x18(%rsp)                                                                                                                                    ▒
       │     trace_lookup():                                                                                                                                             ▒
  1.54 │       mov    0x8(%r9),%rcx                                                                                                                                      ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  0.52 │       and    $0x1,%edx                                                                                                                                          ◆
  0.57 │       mov    %r14d,0xc(%rsp)                                                                                                                                    ▒
  0.40 │       mov    0x78(%rsp),%r10                                                                                                                                    ▒
  1.24 │       sub    %rdx,%r15                                                                                                                                          ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.35 │       shl    %cl,%r12d                                                                                                                                          ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  2.18 │       mov    0x90(%rsp),%r8                                                                                                                                     ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.46 │       imul   %r15,%r13                                                                                                                                          ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  0.59 │       mov    %r15,0x88(%rsp)                                                                                                                                    ▒
       │     trace_lookup():                                                                                                                                             ▒
  0.50 │       lea    -0x1(%r12),%rdx                                                                                                                                    ▒
  1.22 │       shr    $0x2b,%r13                                                                                                                                         ▒
  0.37 │       and    %r13,%rdx                                                                                                                                          ▒
  0.57 │177:   mov    %rdx,%rbp                                                                                                                                          ▒
  0.43 │       shl    $0x4,%rbp                                                                                                                                          ▒
  1.33 │       add    %rdi,%rbp                                                                                                                                          ▒
  0.49 │       mov    0x0(%rbp),%rsi                                                                                                                                     ▒
 24.40 │       cmp    %rsi,%r15                                                                                                                                          ▒
       │     ↓ jne    420                                                                                                                                                ▒
       │     _ULx86_64_tdep_trace():                                                                                                                                     ▒
  2.10 │18e:   movzbl 0x8(%rbp),%edx                                                                                                                                     ▒
  3.68 │       test   $0x8,%dl                                                                                                                                           ▒
       │     ↓ jne    370                                                                                                                                                ▒
  1.27 │       mov    %edx,%eax                                                                                                                                          ▒
  0.06 │       shl    $0x5,%eax                                                                                                                                          ▒
  0.73 │       sar    $0x5,%al                                                                                                                                           ▒
  1.70 │       cmp    $0xfe,%al                                                                                                                                          ▒
       │     ↓ je     380                                                                                                                                                ▒
  0.01 │     ↓ jle    2f0                                                                                                                                                ▒
  0.01 │       cmp    $0xff,%al                                                                                                                                          ▒
       │     ↓ je     3a0                                                                                                                                                ▒
  0.02 │       cmp    $0x1,%al                                                                                                                                           ▒
       │     ↓ jne    298                                                                                                                                                ▒
  0.01 │       and    $0x10,%edx                                                                                                                                         ▒
       │       movl   $0x1,0x10(%rsp)                                                                                                                                    ▒
       │       movl   $0x1,0x1c8(%rbx)                                                                                                                                   ▒
  0.00 │     ↓ je     393

The corresponding source code is here trace_lookup source code, If I read correctly, the number of lines of code corresponding to this hot path cmp instruction is line 296, but I don't know why this line is so slow and cost most of the time?

Solution

Command cmp %rsi,%r15 is marked as having huge overhead because it waits for data to be loaded from cache or memory by mov 0x0(%rbp),%rsi command. There is probably L1 or even L2 cache miss on that command.

For the code fragment

       │     trace_lookup():
  0.50 │       lea    -0x1(%r12),%rdx
  1.22 │       shr    $0x2b,%r13     
  0.37 │       and    %r13,%rdx      
  0.57 │177:   mov    %rdx,%rbp      
  0.43 │       shl    $0x4,%rbp      
  1.33 │       add    %rdi,%rbp      
  0.49 │       mov    0x0(%rbp),%rsi 
 24.40 │       cmp    %rsi,%r15      
       │     ↓ jne    420

you have 24% of profiling events of the current function accounted to the cmp instruction. Default event for sampling profiling is "cycles" (hardware event for CPU clock cycles) or "cpu-clock" (software event for linear time). So, around 24% of sampling interrupts which did interrupt this function were reported for instruction address of this cmp command. There is systematic skew possible with profiling and modern Out-of-order CPUs, when cost is reported not for command which did run slowly, but for the command which did not finish its execution (retire) quickly. This cmp+jne command pair (fused uop) will change instruction flow of program if %rsi register value is not equal to %r15 register value. For ancient times such command should just read two registers and compare their values, which is fast and should not take 1/4 of function execution time. But with modern CPU registers are not just 32 or 64 bit place to store the value, they have some hidden flags (or renaming techniques) used in Out-of-order engines. In your example, there was mov 0x0(%rbp),%rsi which did change %rsi register. This command is load from memory by address *%rbp. CPU did start this load into cache/memory subsystem and mark %rsi register as "load pending from memory", continuing to execute instructions. There are some chances that next instructions will not require result of that load (which takes some time, for example Haswell: 4 cpu cycles for L1 hit, 12 for L2 hit, 36-66 for L3 hit, and additional 50-100 ns for cache miss and RAM read). But in your case next instruction was cmp+jne with read from %rsi, and that instruction can't be finished until data from memory is written to %rsi (CPU may block in the middle of cmp+jne execution or do many restarts of that command). So, cmp has 24% overhead because that mov did miss nearest caches. With more advanced counters you can estimate which cache it did miss, and which cache/memory layer did serve the request most often.

The corresponding source code is here trace_lookup source code, If I read correctly, the number of lines of code corresponding to this hot path cmp instruction is line 296, but I don't know why this line is so slow and cost most of the time?

With so short asm fragment it can be hard to find corresponding code line in source code of trace_lookup and to find what value and why was not in L1/L2 cache. You should try to write shorted reproducible example.