c numpy floating-point compiler-optimization avx

What makes numpy.sum faster than an optimized (auto-vectorized) C loop?

I'm trying to write a C program that is as fast as numpy.sum on an array of doubles, but seem to fail.

Here is how I'm measuring numpy performance:

import numpy as np
import time

SIZE=4000000
REPS=5

xs = np.random.rand(SIZE)
print(xs.dtype)

for _ in range(REPS):
    start = time.perf_counter()
    r = np.sum(xs)
    end = time.perf_counter()
    print(f"{SIZE / (end-start) / 10**6:.2f} MFLOPS ({r:.2f})")

The output is:

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)

Now trying to do something similar in C:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define SIZE 4000000
#define REPS 5

double *make_random_array(long array_size) {
  double *array = malloc(array_size * sizeof(double));
  if (array == NULL)
    return NULL;
  srand(0);
  for (size_t i = 0; i < array_size; ++i) {
    array[i] = (double)rand() / RAND_MAX;
  }
  return array;
}

double sum_array(const double *array, long size) {
  double sum = 0.0;
  for (size_t i = 0; i < size; ++i) {
    sum += array[i];
  }
  return sum;
}

int main() {
  double *xs = make_random_array(SIZE);
  if (xs == NULL) return 1;

  for (int i = 0; i < REPS; i++) {
    clock_t start_time = clock();
    double r = sum_array(xs, SIZE);
    clock_t end_time = clock();
    double dt = (double)(end_time - start_time) / CLOCKS_PER_SEC;
    printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
  }

  free(xs);
  return 0;
}

Compiling this with gcc -o main -Wall -O3 -mavx main.c and running it the output is:

1850.14 MFLOPS (1999882.86)
1857.01 MFLOPS (1999882.86)
1900.24 MFLOPS (1999882.86)
1903.86 MFLOPS (1999882.86)
1906.58 MFLOPS (1999882.86)

Clearly this is slower than numpy.

According to top CPU usage of the python process is around 100%, so it doesn't look like numpy is parallelizing anything.

The C code appears to use 256 bit AVX registers (when compiling with -S there are vaddsd instructions on xmm0). This seems to be the best option, as the machine I'm on doesn't appear to support AVX-512:

$ egrep 'model name|flags' /proc/cpuinfo  | head -n2
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities

What sort of trick does numpy do to beat this piece of C code?

Solution

Your loop is not optimized at all because strict FP math is the default. XMM0 is a 128-bit register, YMM0 is the corresponding 256-bit. vaddsd is ADD Scalar Double, using the low element of XMM0. https://felixcloutier.com/x86/addsd

With clang -O3 -ffast-math -march=native to let it vectorize and unroll (by 4x) to get a 16x speedup, 4x each from AVX and from instruction-level parallelism (wikipedia / Modern Microprocessors A 90-Minute Guide!), with an array small enough to not bottleneck on L3 cache bandwidth. (Another approximately 2x performance is available with an array that fits in L1d, not just L2, e.g. with #pragma clang loop interleave_count(8) to unroll more, for code you've cache-blocked to usually get hits in L1d cache.)

Your Raptor Lake CPU has two fully-pipelined vector-FP add units, with pipeline length 3 (cycles of latency before the result is ready to be an input to another add). This answer includes my results on i7-6700k Skylake which has the same except the FP-add pipelines have 4 cycle latency.

@Jérôme Richard commented that NumPy just does scalar pairwise summation for sums of FP arrays, which gains some ILP over pure naive serial. Can be ok if you're bottlenecked on DRAM bandwidth. One upside is numerical consistency across ISAs and available SIMD features, achieved by not using them.

You're looking for vaddpd ymm0, ymm0, [rdi] (Packed Double on a 256-bit vector). GCC will do this with -ffast-math which among other things lets it pretend FP math ops are associative, changing rounding error. (For the better in this case, e.g. if you compare against a long double sum or Kahan error-compensated summation; this is a step in the direction of the same idea as pairwise summation.) See also https://gcc.gnu.org/wiki/FloatingPointMath

gcc -O3 -march=native -ffast-math  foo.c

This gives about a factor of 4 speedup since FP ALU latency (1 vector instead of 1 scalar per 3 cycles on your CPU) is still a worse bottleneck than L3 bandwidth, definitely worse than L2 cache bandwidth.

SIZE=4000000 times sizeof(double) is 30.52 MiB, so it will fit in your high-end Raptor Lake's 36MiB L3 cache. But to go much faster, you're going to need to reduce SIZE and crank up REPS (and perhaps put a repeat-loop inside a timed region.) The whole program is pretty short to profile with perf stat, under 100 milliseconds on my i7-6700k Skylake with DDR4-2666, and much of that is startup. It's also pretty short to be timing with clock() instead of clock_gettime.

Your per-core cache sizes are 48 KiB L1d, 2 MiB L2 (on the Golden Cove P-cores, less/more on a single Gracemont E-core). https://en.wikipedia.org/wiki/Raptor_Lake / https://chipsandcheese.com/2022/08/23/a-preview-of-raptor-lakes-improved-l2-caches/ . SIZE=6144 would make the array the full size of L1d. If we aim for a 40 KiB footprint for just the array, leaving room for some other stuff, SIZE=5120. Might as well make it aligned by 32 bytes with aligned_alloc so we can read it from L1d cache at 3 vectors (96 bytes) per clock cycle instead of having a cache-line split every other vector. (https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/ / https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-throughput-limit / https://uops.info/)

To get anywhere near peak FLOPS (within a factor of 2 since we're not using FMA), we need to run 2 vaddpd instructions every clock cycle. But it has a latency of 3 cycles on your Golden Cove P-cores (Alder/Raptor Lake), so the latency * bandwidth product is 6 vaddpd in flight at once. That's the minimum number of dependency chains, preferably at least 8. Anything less will leave loop-carried dependency chains as the bottleneck, not throughput. (Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators))

So you're looking for additional instructions in the inner loop like vaddpd ymm1, ymm1, [rdi+32]. Golden Cove's 3c latency / 0.5c reciprocal throughput vaddps/pd is due to dedicated SIMD-FP add ALUs instead of the 4 cycle pipeline for mul/FMA execution units, which were also used for add/sub since Skylake. Unlike Haswell, Golden Cove (Alder/Raptor Lake P-core) has two such ALUs so the throughput is still as good as FMA.

GCC's -funroll-loops is useless here, unrolling the loop but still with only one accumulator vector. (Even with #pragma omp simd reduction (+:sum) and -fopenmp.) Clang will unroll with 4 accumulators by default. With -march=raptorlake it will unroll by 20, but still only 4 accumulators, so doing 5 adds into each vector. And using indexed addressing modes like [rdi + 8*rcx + 32] so each vaddpd ymm, ymm, [reg+reg*8] un-laminates to 2 uops, not reducing front-end cost nearly as much as it could. There's only one array involved so there wouldn't even be any extra cost to use a pointer increment instead of an index, it's not doing anything clever like indexing relative to the end of the array with a negative index that counts up toward zero. But this isn't a bottleneck; Golden Cove's wide front-end (6 uops) can issue 3 such vaddpd [mem+idx] instruction per cycle, so stay ahead of the back-end (2/clock). Even 4-wide Skylake can keep up with this limited unroll.

#pragma clang loop interleave_count(8) before the for() gets clang to unroll with more accumulators. (With more than 8 it ignores it and just does 4 :/) This is maybe only a good idea for code you expect to get L1d hits; if you expect your array needs to come from L2 or farther away, the default is good. Of course, the non-interleaving part of the unroll is just a waste of code-size in that case, too, and would cost more cleanup code if n weren't a compile-time constant. Docs: https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

With the defaults, no pragmas but -O3 -ffast-math -march=native (on Skylake also using -mbranches-within-32B-boundaries), we get the same unroll by 20 with an interleave of 4 accumulators as Clang uses on Raptor Lake. (It also fully unrolls the REPS timing/printing loop, so this big loop is repeated 5 times. This is almost certainly worse than spending 1 register and a couple instructions to recycle code that's already hot in cache.)

# clang 16  no pragma, unrolls by 20 with 4 accumulators
inner_loop_top:
    1360:       c5 fd 58 84 cb a0 fd ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0x260]
    1369:       c5 f5 58 8c cb c0 fd ff ff      vaddpd ymm1,ymm1,[rbx+rcx*8-0x240]
    1372:       c5 ed 58 94 cb e0 fd ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0x220]
    137b:       c5 e5 58 9c cb 00 fe ff ff      vaddpd ymm3,ymm3, [rbx+rcx*8-0x200]
    1384:       c5 fd 58 84 cb 20 fe ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0x1e0]
  ... ymm1, ymm2
    139f:       c5 e5 58 9c cb 80 fe ff ff      vaddpd ymm3,ymm3,[rbx+rcx*8-0x180]

... 2 more copies of ymm0..3, ending with the next insn, the first to use a 1-byte disp8
    13e7:       c5 e5 58 5c cb 80       vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]

    13ed:       c5 fd 58 44 cb a0       vaddpd ymm0,ymm0, [rbx+rcx*8-0x60]
    13f3:       c5 f5 58 4c cb c0       vaddpd ymm1,ymm1, [rbx+rcx*8-0x40]
    13f9:       c5 ed 58 54 cb e0       vaddpd ymm2,ymm2, [rbx+rcx*8-0x20]
    13ff:       c5 e5 58 1c cb          vaddpd ymm3,ymm3, [rbx+rcx*8]
    1404:       48 83 c1 50             add    rcx,0x50
    1408:       48 81 f9 ec 0f 00 00    cmp    rcx,0xfec
    140f:       0f 85 4b ff ff ff       jne    1360 <main+0x80>

vs. with the pragma, unroll by 16, with 8 accumulators, when inlined into main. 4000 isn't quite a multiple of 16 x 4, so the loop exit condition is in between sets of 8 adds, in the middle of the loop.

# clang 16  with pragma, unrolls by 16 with 8 accumulators
inner_loop_top:
    13f0:       c5 fd 58 84 cb 20 fe ff ff      vaddpd ymm0,ymm0,[rbx+rcx*8-0x1e0]
    13f9:       c5 f5 58 8c cb 40 fe ff ff      vaddpd ymm1,ymm1,[rbx+rcx*8-0x1c0]
    1402:       c5 ed 58 94 cb 60 fe ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0x1a0]
    140b:       c5 e5 58 9c cb 80 fe ff ff      vaddpd ymm3,ymm3, [rbx+rcx*8-0x180]
    1414:       c5 dd 58 a4 cb a0 fe ff ff      vaddpd ymm4,ymm4,[rbx+rcx*8-0x160]
    141d:       c5 d5 58 ac cb c0 fe ff ff      vaddpd ymm5,ymm5, [rbx+rcx*8-0x140]
    1426:       c5 cd 58 b4 cb e0 fe ff ff      vaddpd ymm6,ymm6,[rbx+rcx*8-0x120]
    142f:       c5 c5 58 bc cb 00 ff ff ff      vaddpd ymm7,ymm7, [rbx+rcx*8-0x100]
    1438:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]       # JCC erratume workaround
    1440:       48 81 f9 bc 0f 00 00    cmp    rcx,0xfbc
    1447:       0f 84 33 ff ff ff       je     1380 <main+0x60>
    144d:       c5 fd 58 84 cb 20 ff ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0xe0]
    1456:       c5 f5 58 8c cb 40 ff ff ff      vaddpd ymm1,ymm1, [rbx+rcx*8-0xc0]
    145f:       c5 ed 58 94 cb 60 ff ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0xa0]
    1468:       c5 e5 58 5c cb 80       vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]
    146e:       c5 dd 58 64 cb a0       vaddpd ymm4,ymm4, [rbx+rcx*8-0x60]
    1474:       c5 d5 58 6c cb c0       vaddpd ymm5,ymm5, [rbx+rcx*8-0x40]
    147a:       c5 cd 58 74 cb e0       vaddpd ymm6,ymm6, [rbx+rcx*8-0x20]
    1480:       c5 c5 58 3c cb          vaddpd ymm7,ymm7, [rbx+rcx*8]
    1485:       48 83 c1 40             add    rcx,0x40
    1489:       e9 62 ff ff ff          jmp    13f0 <main+0xd0>

I tried changing the source to encourage the compiler to increment a pointer, but clang doesn't take the hint, inventing an index counter in a register it uses like [rdi + r8*8 + 0x20]

  const double * endp = array+size;
#pragma clang loop interleave_count(8)
  while (array != endp) {  // like a C++ range-for
    sum += *array++;       // no benefit, clang pessimizes back to an index
  }

Updated microbenchmark source code

// #define SIZE 5120 // 40 KiB, fits in Raptor Lake's 48KiB
#define SIZE 4000     // fits in SKL's 32KiB L1d cache
#define REPS 5

...

        double *array = aligned_alloc(32, array_size * sizeof(double));
//  double *array = malloc(array_size * sizeof(double));

...

double sum_array(const double *array, long size) {
  double sum = 0.0;
//#pragma clang loop interleave_count(8)   // uncomment this, optionally
  for (size_t i = 0; i < size; ++i) {
    sum += array[i];
  }
  return sum;
}


int main() {
  double *xs = make_random_array(SIZE);
  if (xs == NULL) return 1;

  const int  inner_reps = 1000000;  // sum the array this many times each timed interval
  for (int i = 0; i < REPS; i++) {
    clock_t start_time = clock();
    volatile double r;  // do something with the sum even when we don't print
    for (int i=0 ; i<inner_reps ; i++){  // new inner loop
       r = sum_array(xs, SIZE);
       //  asm(""::"r"(xs) :"memory");  // forget about the array contents and redo the sum
       // turned out not to be necessary, clang is still doing the work
    }
    clock_t end_time = clock();
    double dt = (double)(end_time - start_time) / (CLOCKS_PER_SEC * inner_reps);
    printf("%.2f MFLOPS (%.2f)\n", (double)SIZE / dt / 1000000, r);
  }

  free(xs);
  return 0;
}

With a const int inner_reps = 1000000; repeat count of sums inside each timed interval added, and some measures to make sure the optimizer doesn't defeat it (Godbolt - also SIZE reduced to 4000 to fit in my 32KiB L1d), on my Skylake at 4.2 GHz, I get a 16x speedup as expected.

GCC 13.2.1, clang 16.0.6 on Arch GNU/Linux, kernel 6.5

# Without any vectorization
$ gcc -O3 -march=native -Wall arr-sum.c
taskset -c 1 perf stat  -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_single   -r1 ./a.out 1057.69 MFLOPS (2003.09)
1059.17 MFLOPS (2003.09)
1059.67 MFLOPS (2003.09)
1060.30 MFLOPS (2003.09)
1060.34 MFLOPS (2003.09)
... perf results below

# with 1 vector accumulator
$ gcc -O3 -march=native -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
4389.68 MFLOPS (2003.09)
4389.32 MFLOPS (2003.09)
4381.48 MFLOPS (2003.09)
4393.57 MFLOPS (2003.09)
4389.98 MFLOPS (2003.09)
... perf results below

# unrolled by 4 vectors
$ clang -O3 -march=native -ffast-math -Wall arr-sum.c   # clang unrolls by default
$ taskset -c 1 perf stat ... a.out
17048.41 MFLOPS (2003.09)
17072.49 MFLOPS (2003.09)
17060.55 MFLOPS (2003.09)
17081.02 MFLOPS (2003.09)
17099.79 MFLOPS (2003.09)
... perf results below, but including:
     2,303,995,395      idq.mite_uops                    #    1.965 G/sec                     
  # suffering from the JCC erratum in the inner loop; avoid it:

$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
17013.53 MFLOPS (2003.09)
17061.79 MFLOPS (2003.09)
17064.99 MFLOPS (2003.09)
17109.44 MFLOPS (2003.09)
17001.74 MFLOPS (2003.09)
... perf results below; summary: 1.17 seconds
     4,905,130,231      cycles                           #    4.178 GHz                       
     5,941,872,098      instructions                     #    1.21  insn per cycle
         5,165,165      idq.mite_uops                    #    4.399 M/sec
     5,015,000,000      fp_arith_inst_retired.256b_packed_double #    4.271 G/sec

 # With  #pragma clang loop interleave_count(8) in the source
 # for unrolling by 8 instead of 4
$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c
$ taskset -c 1 perf stat ... a.out
28505.05 MFLOPS (2003.09)
28553.48 MFLOPS (2003.09)
28556.13 MFLOPS (2003.09)
28597.37 MFLOPS (2003.09)
28548.18 MFLOPS (2003.09)
 # imperfect scheduling and a front-end bottleneck from clang's bad choice of addressing-mode
 # means we don't get another 2x over the default.

(With perf stat -d, I also confirmed the L1d cache miss rate was under 1%. With a larger array size, like 20000 which fits in Skylake's 256K L2 cache but not L1d, I still got fairly close to 1 vector per clock throughput.)

The JCC erratum workaround (Skylake-family only, not your CPU) gave negligible further speedup in this case, the front-end wasn't the bottleneck even with legacy decode: un-lamination happens at issue so the decoders aren't choking on 2-uop instructions. And average uops_issued.any throughput was still only 2.18 / clock with 4x unroll.

So we get a factor of 16 speedup on Skylake from vectorizing with AVX (4x) and instruction-level parallelism of 4 accumulators. This is still only averaging just slightly better than 1 vaddpd per clock cycle (since there's ILP between repeat-loop iterations), but clang's 4 dep chains is only half of Skylake's 4 cycle latency x 2 insn/cycle throughput = 8 FP math instructions in flight.

Unrolling by 4 leaves another factor of 2 performance left on the table (for Skylake, less for Alder Lake and later. Update: we got most of it with the pragma). But it's only attainable with data hot in L1d cache, with careful cache-blocking, or if you're doing more work with data while it's in registers (higher computational intensity, not just 1 add per load). Getting another full 2x would also require an optimizer that's aware of Sandybridge-family un-lamination, which clang's obviously isn't. Clang's default 4 accumulators seems reasonable, and more accumulators would mean more init and cleanup work, although unrolling by 20 with only 4 accumulators seems excessive, like a waste of I-cache / uop-cache footprint.

Perf counter results

Counting in user-space only on i7-6700k Skylake (EPP=performance) with Linux kernel & perf 6.5. This is for the entire process, including startup, but the inner repeat count of 1 million means the vast majority of its total time is spent in the loop we care about, not startup.

Scalar loop:
Performance counter stats for './a.out' (GCC O3-native without fast-math):

     18,902.70 msec task-clock                       #    1.000 CPUs utilized
            54      context-switches                 #    2.857 /sec
             0      cpu-migrations                   #    0.000 /sec
            72      page-faults                      #    3.809 /sec
79,099,401,032      cycles                           #    4.185 GHz
35,069,666,963      instructions                     #    0.44  insn per cycle
30,109,096,046      uops_issued.any                  #    1.593 G/sec
50,096,899,159      uops_executed.thread             #    2.650 G/sec
    46,353,551      idq.mite_uops                    #    2.452 M/sec
             0      fp_arith_inst_retired.256b_packed_double #    0.000 /sec

  18.902876984 seconds time elapsed

  18.893778000 seconds user
   0.000000000 seconds sys

Note the 0 counts for fp_arith_inst_retired.256b_packed_double - no 256-bit SIMD instructions.

Vectorized but not unrolled:
Performance counter stats for './a.out' (GCC O3-native-fast-math):

      4,559.54 msec task-clock                       #    1.000 CPUs utilized
             8      context-switches                 #    1.755 /sec
             0      cpu-migrations                   #    0.000 /sec
            74      page-faults                      #   16.230 /sec
19,093,881,407      cycles                           #    4.188 GHz
20,060,557,627      instructions                     #    1.05  insn per cycle
15,094,070,341      uops_issued.any                  #    3.310 G/sec
20,075,885,996      uops_executed.thread             #    4.403 G/sec
    12,015,692      idq.mite_uops                    #    2.635 M/sec
 5,000,000,000      fp_arith_inst_retired.256b_packed_double #    1.097 G/sec

   4.559770793 seconds time elapsed

   4.557838000 seconds user
   0.000000000 seconds sys

Vectorized, unrolled by 20 with 4 accumulators:
Performance counter stats for './a.out': (Clang -O3-native-fast-math JCC-workaround)

      1,174.07 msec task-clock                       #    1.000 CPUs utilized
             5      context-switches                 #    4.259 /sec
             0      cpu-migrations                   #    0.000 /sec
            72      page-faults                      #   61.325 /sec
 4,905,130,231      cycles                           #    4.178 GHz
 5,941,872,098      instructions                     #    1.21  insn per cycle
10,689,939,125      uops_issued.any                  #    9.105 G/sec
10,566,645,887      uops_executed.thread             #    9.000 G/sec
     5,165,165      idq.mite_uops                    #    4.399 M/sec
 5,015,000,000      fp_arith_inst_retired.256b_packed_double #    4.271 G/sec

   1.174507232 seconds time elapsed

   1.173769000 seconds user
   0.000000000 seconds sys

Note the slightly more 256-bit vector instructions: that's 3x vaddpd to reduce 4 accumulators down to 1 before doing the horizontal sum work down to 1 scalar. (Which starts with a vextractf128 of the high half, then using 128-bit vector instructions. So this counter doesn't count them, but they still compete with the work of the next iteration starting.)

Vectorized, unrolled by 16 with 8 accumulators:
Performance counter stats for './a.out' (clang -O3 native fast-math #pragma ... interleave_count(8)):

        701.30 msec task-clock                       #    0.999 CPUs utilized
             3      context-switches                 #    4.278 /sec
             0      cpu-migrations                   #    0.000 /sec
            71      page-faults                      #  101.241 /sec
 2,931,696,392      cycles                           #    4.180 GHz
 6,566,898,298      instructions                     #    2.24  insn per cycle
11,249,046,508      uops_issued.any                  #   16.040 G/sec
11,019,891,003      uops_executed.thread             #   15.714 G/sec
     3,153,961      idq.mite_uops                    #    4.497 M/sec
 5,035,000,000      fp_arith_inst_retired.256b_packed_double #    7.180 G/sec

   0.701728321 seconds time elapsed

   0.701217000 seconds user
   0.000000000 seconds sys

Even more cleanup work after the loop, 7x vaddpd to get down to 1 vector. And not a 2x speedup, instead bottlenecking on 16.040 uops / 4.180 GHz ~= 3.87 average uops issued / clock, most cycles issuing Skylake's max of 4. This is because Clang/LLVM doesn't know how to tune for Intel CPUs, using indexed addressing modes. (uops executed is actually lower than uops issued, so very little micro-fusion of loads with ALU, and 8x vxorps zeroing of 8 vectors before each iteration that need an issue slot but no back-end execution unit.)

7.180 / 4.18 GHz = an average of 1.71 256-bit FP instructions executed per clock cycle.

(The CPU was probably running at 4.20 GHz the whole time, but that frequency is derived from counts for cycles (in user-space only) divided by task-clock. Time spent in the kernel (on page-faults and interrupts) isn't being counted because we used perf stat --all-user)

Hand-written asm loop:

Fixing the front-end bottleneck by avoiding indexed addressing modes gets up up to 1.81 vaddpd/clock, up from 1.71. (Not 2.0 because imperfect uop scheduling loses a cycle with no slack to make it up.) That's about 30281.47 MFLOP/S at 4.2 GHz on a single core.

As a starting point, I used clang -O3 -fno-unroll-loops -S -march=native -ffast-math -Wall arr-sum.c -masm=intel -o arr-sum-asm.S on the C version with the unroll pragma, so that loop was still unrolled with 8 accumulators, but only unrolled by 8 not 16.

And the outer repeat loop stayed rolled up, so I only had to hand-edit one copy of the asm loop (inlined into main.) The ds prefixes on a couple instructions are to work around the JCC erratum. Note that none of these instructions need disp32 addressing modes since I put the pointer increment in the right place to benefit from the full -0x80 .. +0x7f, actually going from -0x80 to +0x60. So machine-code size is much smaller than clang's, with instruction lengths of 5 (or 4 for [rdi+0]). The add ends up needing an imm32, but there's just one of it. And critically, these stay micro-fused, cutting front-end uop bandwidth almost in half.

    vxorpd  xmm0, xmm0, xmm0
 ...
    vxorpd  xmm7, xmm7, xmm7    # compiler-generated sumvec = 0
    mov     ecx, 4000 / (4*8)   # loop trip-count
    mov    rdi, rbx             # startp = arr
    .p2align        4, 0x90
.LBB2_7:                #   Parent Loop BB2_5 Depth=1
                        #     Parent Loop BB2_6 Depth=2
                        # =>    This Inner Loop Header: Depth=3
    ds vaddpd ymm0, ymm0, [rdi + 32*0]
    vaddpd  ymm1, ymm1, [rdi + 32*1]
    vaddpd  ymm2, ymm2, [rdi + 32*2]
    ds vaddpd   mm5, ymm5, [rdi + 32*3]
    add   rdi, 256
    vaddpd  ymm3, ymm3, [rdi - 32*4]
    ds vaddpd   ymm6, ymm6, [rdi - 32*3]
    vaddpd  ymm7, ymm7, [rdi - 32*2]
    vaddpd  ymm4, ymm4, [rdi - 32*1]
    dec     rcx           # not spanning a 32B boundary
    jne     .LBB2_7
# %bb.8:                                #   in Loop: Header=BB2_6 Depth=2
    vaddpd  ymm0, ymm1, ymm0
    vaddpd  ymm1, ymm5, ymm2
    ... hsum

$ taskset -c 1  perf stat  -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_double   -r1 ./a.out 
30281.47 MFLOPS (2003.09)
30057.33 MFLOPS (2003.09)
30138.64 MFLOPS (2003.09)
30160.00 MFLOPS (2003.09)
29979.61 MFLOPS (2003.09)

 Performance counter stats for './a.out':

            664.79 msec task-clock                       #    0.999 CPUs utilized             
                 3      context-switches                 #    4.513 /sec
                 0      cpu-migrations                   #    0.000 /sec
                73      page-faults                      #  109.809 /sec
     2,775,830,392      cycles                           #    4.176 GHz
     7,007,878,485      instructions                     #    2.52  insn per cycle
     6,457,154,731      uops_issued.any                  #    9.713 G/sec
    11,378,180,211      uops_executed.thread             #   17.115 G/sec
         3,634,644      idq.mite_uops                    #    5.467 M/sec
     5,035,000,000      fp_arith_inst_retired.256b_packed_double #    7.574 G/sec

       0.665220579 seconds time elapsed

       0.664698000 seconds user
       0.000000000 seconds sys

uops_issued.any is now about 2.32 / cycle, plenty of headroom for uop-cache fetch and other front-end bottlenecks.

Unrolling by 10 instead of 8 only brings a tiny speedup, like 662.49 ms total time on a good run, and best MFLOPS 30420.80. IPC around 2.41.

L1d cache bandwidth is the last bottleneck on SKL before saturating the FP pipes, not quite sustaining 2x 32-byte loads/clock. Changing 3 of the insns to add the register to itself (sum7 += sum7) speeds up the 10-accumulator version to 619.11 ms total time, best MFLOPS 32424.90, 2.58 IPC, average 1.957 256-bit uops/clock. (And the FP ports have to compete with a couple 128-bit adds in the cleanup.)

Raptor Lake can do 3 load/clock even for vectors so shouldn't be a problem there.