assembly openmp performance-testing icc microbenchmark

Faulty benchmark, puzzling assembly

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
    #pragma noinline
    do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);

Code snippet: do_timed_run:

void do_timed_run(const std::size_t& n, double& avg_dur)
{
    // create the data and lay first touch
    auto operand0 = matrix<double>(n, n);
    auto operand1 = tensor<double>(n, n, n);
    auto result = tensor<double>(n, n, n);
    
    // first touch
    #pragma omp parallel
    {
        set_first_touch(operand1);
        set_first_touch(result);
    }
    
    // do the experiment
    const auto dur1 = omp_get_wtime() * 1E+6;
    #pragma omp parallel firstprivate(operand0)
    {
        #pragma noinline
        transp_matrix_tensor_mult(operand0, operand1, result);
    }
    const auto dur2 = omp_get_wtime() * 1E+6;
    avg_dur += dur2 - dur1;
}

Notes:

At this point, I'm not providing the code for the function transp_matrix_tensor_mult because I don't think it is relevant.
the #pragma noinline is a debug fixture I'm using to be able to better understand the output of the disassembler.

And now for the disassembly of the function do_timed_run:

0000000000403a20 <_Z12do_timed_runRKmRd>:
  403a20:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403a27:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403a2e:   00 
  403a2f:   48 89 fd                mov    %rdi,%rbp
  403a32:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403a39:   00 
  403a3a:   48 89 f3                mov    %rsi,%rbx
  403a3d:   48 89 ee                mov    %rbp,%rsi
  403a40:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403a45:   48 89 ea                mov    %rbp,%rdx
  403a48:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403a4f:   00 
  403a50:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403a57:   00 
  403a58:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403a5f:   00 
  403a60:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403a67:   00 
  403a68:   e8 03 f8 ff ff          callq  403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
  403a6d:   48 89 ee                mov    %rbp,%rsi
  403a70:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403a75:   48 89 ea                mov    %rbp,%rdx
  403a78:   48 89 e9                mov    %rbp,%rcx
  403a7b:   e8 80 f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a80:   48 89 ee                mov    %rbp,%rsi
  403a83:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403a88:   48 89 ea                mov    %rbp,%rdx
  403a8b:   48 89 e9                mov    %rbp,%rcx
  403a8e:   e8 6d f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a93:   bf 88 f3 44 00          mov    $0x44f388,%edi
  403a98:   e8 53 f7 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403a9d:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403aa4:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403aa9:   33 c0                   xor    %eax,%eax
  403aab:   e8 20 f6 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403ab0:   85 c0                   test   %eax,%eax
  403ab2:   74 21                   je     403ad5 <_Z12do_timed_runRKmRd+0xb5>
  403ab4:   ba a5 3c 40 00          mov    $0x403ca5,%edx
  403ab9:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403abe:   be 02 00 00 00          mov    $0x2,%esi
  403ac3:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403ac8:   33 c0                   xor    %eax,%eax
  403aca:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403ace:   e8 cd f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403ad3:   eb 41                   jmp    403b16 <_Z12do_timed_runRKmRd+0xf6>
  403ad5:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403ada:   33 c0                   xor    %eax,%eax
  403adc:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403ae3:   e8 58 f7 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403ae8:   be 9c 13 47 00          mov    $0x47139c,%esi
  403aed:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403af4:   00 
  403af5:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403afa:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403afe:   e8 a2 01 00 00          callq  403ca5 <_Z12do_timed_runRKmRd+0x285>
  403b03:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403b08:   33 c0                   xor    %eax,%eax
  403b0a:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b11:   e8 aa f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b16:   e8 85 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b1b:   c5 fb 11 04 24          vmovsd %xmm0,(%rsp)
  403b20:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b25:   33 c0                   xor    %eax,%eax
  403b27:   e8 a4 f5 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403b2c:   85 c0                   test   %eax,%eax
  403b2e:   74 25                   je     403b55 <_Z12do_timed_runRKmRd+0x135>
  403b30:   ba 0b 3c 40 00          mov    $0x403c0b,%edx
  403b35:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b3a:   be 03 00 00 00          mov    $0x3,%esi
  403b3f:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403b44:   33 c0                   xor    %eax,%eax
  403b46:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403b4a:   4c 8d 49 70             lea    0x70(%rcx),%r9
  403b4e:   e8 4d f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403b53:   eb 45                   jmp    403b9a <_Z12do_timed_runRKmRd+0x17a>
  403b55:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b5a:   33 c0                   xor    %eax,%eax
  403b5c:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b63:   e8 d8 f6 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403b68:   be a0 13 47 00          mov    $0x4713a0,%esi
  403b6d:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403b74:   00 
  403b75:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403b7a:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403b7e:   4c 8d 42 70             lea    0x70(%rdx),%r8
  403b82:   e8 84 00 00 00          callq  403c0b <_Z12do_timed_runRKmRd+0x1eb>
  403b87:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b8c:   33 c0                   xor    %eax,%eax
  403b8e:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b95:   e8 26 f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b9a:   e8 01 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b9f:   c5 fb 5c 0c 24          vsubsd (%rsp),%xmm0,%xmm1
  403ba4:   c5 fb 10 05 cc c4 01    vmovsd 0x1c4cc(%rip),%xmm0        # 420078 <alpha_beta.61562.0.0.28+0x28>
  403bab:   00 
  403bac:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403bb1:   c4 e2 f9 a9 0b          vfmadd213sd (%rbx),%xmm0,%xmm1
  403bb6:   c5 fb 11 0b             vmovsd %xmm1,(%rbx)
  403bba:   e8 71 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bbf:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403bc4:   e8 67 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bc9:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403bce:   e8 5d f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bd3:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403bda:   00 
  403bdb:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403be2:   00 
  403be3:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403bea:   00 
  403beb:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403bf2:   00 
  403bf3:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403bfa:   00 
  403bfb:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c02:   00 
  403c03:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403c0a:   c3                      retq   
  403c0b:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403c12:   4c 89 c6                mov    %r8,%rsi
  403c15:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403c1c:   00 
  403c1d:   4c 8d 24 24             lea    (%rsp),%r12
  403c21:   4c 89 e7                mov    %r12,%rdi
  403c24:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403c2b:   00 
  403c2c:   48 89 cd                mov    %rcx,%rbp
  403c2f:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403c36:   00 
  403c37:   48 89 d3                mov    %rdx,%rbx
  403c3a:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403c41:   00 
  403c42:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403c49:   00 
  403c4a:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403c51:   00 
  403c52:   e8 49 03 00 00          callq  403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
  403c57:   4c 89 e7                mov    %r12,%rdi
  403c5a:   48 89 de                mov    %rbx,%rsi
  403c5d:   48 89 ea                mov    %rbp,%rdx
  403c60:   e8 8b 01 00 00          callq  403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
  403c65:   4c 89 e7                mov    %r12,%rdi
  403c68:   e8 63 01 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>     # <--- ...and here it ends
  403c6d:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403c74:   00 
  403c75:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403c7c:   00 
  403c7d:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403c84:   00 
  403c85:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403c8c:   00 
  403c8d:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403c94:   00 
  403c95:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c9c:   00 
  403c9d:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403ca4:   c3                      retq   
  403ca5:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403cac:   48 89 d7                mov    %rdx,%rdi
  403caf:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403cb6:   00 
  403cb7:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403cbe:   00 
  403cbf:   48 89 cb                mov    %rcx,%rbx
  403cc2:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403cc9:   00 
  403cca:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403cd1:   00 
  403cd2:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403cd9:   00 
  403cda:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403ce1:   00 
  403ce2:   e8 99 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
  403ce7:   48 89 df                mov    %rbx,%rdi
  403cea:   e8 91 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
  403cef:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403cf6:   00 
  403cf7:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403cfe:   00 
  403cff:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403d06:   00 
  403d07:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403d0e:   00 
  403d0f:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403d16:   00 
  403d17:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403d1e:   00 
  403d1f:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403d26:   c3                      retq   
  403d27:   48 89 04 24             mov    %rax,(%rsp)
  403d2b:   bf 30 f4 44 00          mov    $0x44f430,%edi
  403d30:   e8 bb f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d35:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d3c:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403d41:   e8 9a 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d46:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403d4b:   e8 90 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d50:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403d55:   e8 76 00 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>
  403d5a:   48 8b 3c 24             mov    (%rsp),%rdi
  403d5e:   e8 5d f3 ff ff          callq  4030c0 <_Unwind_Resume@plt>
  403d63:   48 89 04 24             mov    %rax,(%rsp)
  403d67:   bf 68 f4 44 00          mov    $0x44f468,%edi
  403d6c:   e8 7f f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d71:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d78:   eb cc                   jmp    403d46 <_Z12do_timed_runRKmRd+0x326>
  403d7a:   48 89 04 24             mov    %rax,(%rsp)
  403d7e:   bf a0 f4 44 00          mov    $0x44f4a0,%edi
  403d83:   e8 68 f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d88:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d8f:   eb bf                   jmp    403d50 <_Z12do_timed_runRKmRd+0x330>
  403d91:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  403d98:   00 
  403d99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

Primary questions:

Am I right in assuming that the function is being called outside the timed region?
If the above is true, why is this happening?
If the above isn't true, how can I find out why my benchmark is faulty?

Secondary questions:

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Please consider that I have only provided the information which I think is relevant. Additional information will be gladly provided upon request. Thank you in advance for your time.

Edit:

@PeterCordes I did build with debug symbols enabled. The assembly posted above has been obtained using objdump, which somehow did not retrieve the required symbols. Here's (a snippet of) the assembly obtained using icpc:

#       omp_get_wtime()
        call      omp_get_wtime                                 #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
                                # LOE rbx xmm0
..B4.12:                        # Preds ..B4.11
                                # Execution count [1.00e+00]
..LN420:
        vmovsd    %xmm0, (%rsp)                                 #122.23[spill]
..LN421:
                                # LOE rbx
..B4.13:                        # Preds ..B4.12
                                # Execution count [1.00e+00]
..LN422:
    .loc    1  123  is_stmt 1
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN423:
        xorl      %eax, %eax                                    #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
        call      __kmpc_ok_to_fork                             #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
                                # LOE rbx eax
..B4.14:                        # Preds ..B4.13
                                # Execution count [1.00e+00]
..LN426:
        testl     %eax, %eax                                    #123.5
..LN427:
        je        ..B4.17       # Prob 50%                      #123.5
..LN428:
                                # LOE rbx
..B4.15:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN429:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN430:
        xorl      %edx, %edx                                    #123.5
..LN431:
        incq      %rdx                                          #123.5
..LN432:
        xorl      %eax, %eax                                    #123.5
..LN433:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
        call      __kmpc_push_num_threads                       #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
                                # LOE rbx
..B4.16:                        # Preds ..B4.15
                                # Execution count [0.00e+00]
..LN436:
        movl      $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN438:
        movl      $3, %esi                                      #123.5
..LN439:
        lea       8(%rsp), %rcx                                 #123.5
..LN440:
        xorl      %eax, %eax                                    #123.5
..LN441:
        lea       56(%rcx), %r8                                 #123.5
..LN442:
        lea       112(%rcx), %r9                                #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
        call      __kmpc_fork_call                              #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
        jmp       ..B4.20       # Prob 100%                     #123.5
..LN445:
                                # LOE rbx
..B4.17:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN446:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN447:
        xorl      %eax, %eax                                    #123.5
..LN448:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
        call      __kmpc_serialized_parallel                    #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
                                # LOE rbx
..B4.18:                        # Preds ..B4.17
                                # Execution count [0.00e+00]
..LN451:
        movl      $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi    #123.5
..LN452:
        lea       208(%rsp), %rdi                               #123.5
..LN453:
        lea       8(%rsp), %rdx                                 #123.5
..LN454:
        lea       56(%rdx), %rcx                                #123.5
..LN455:
        lea       112(%rdx), %r8                                #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
        call      L__Z12do_timed_runRKmRd_123__par_region1_2.5  #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
                                # LOE rbx
..B4.19:                        # Preds ..B4.18
                                # Execution count [0.00e+00]
..LN458:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN459:
        xorl      %eax, %eax                                    #123.5
..LN460:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
        call      __kmpc_end_serialized_parallel                #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
                                # LOE rbx
..B4.20:                        # Preds ..B4.16 ..B4.19
                                # Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
    .loc    1  128  is_stmt 1
#       omp_get_wtime()
        call      omp_get_wtime                                 #128.23

As you can see, the output is very verbose and harder to read.

Solution

1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384 GFLOP/S. (Using vectors of 4 doubles, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (as well as the __kmpc_end_serialized_parallel stuff).

There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.

As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

Once you're already in an if block, you unconditionally know the else block should not run, so you jmp over it instead of jcc (even if FLAGS were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc to it, then it jmps back to after the other side. That allows the fast path to be contiguous with no taken branches.

Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret instead of jumping to a ret. (And copies of any cleanup necessary, like restoring regs and stack pointer.)