Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

So I decided to take a look at how to use SSE, AVX, ... in C via Intel® Intrinsics. Not because of any actual interest to use it for something, but out of pure curiosity. Trying to check if code using AVX is actually faster than non-AVX code, I was a bit surprised by the results. Here is my C code:

#include <stdio.h>
#include <stdlib.h>

#include <emmintrin.h>
#include <immintrin.h>


/*** Sum up two vectors using AVX ***/
#define __vec_sum_4d_d64(src_vec1, src_vec2, dst_vec) \
  _mm256_store_pd(dst_vec, _mm256_add_pd(_mm256_load_pd(src_vec1), _mm256_load_pd(src_vec2)));

/*** Sum up two vectors without AVX ***/
#define __vec_sum_4d(src_vec1, src_vec2, dst_vec) \
  dst_vec[0] = src_vec1[0] + src_vec2[0];\
  dst_vec[1] = src_vec1[1] + src_vec2[1];\
  dst_vec[2] = src_vec1[2] + src_vec2[2];\
  dst_vec[3] = src_vec1[3] + src_vec2[3];


int main (int argc, char *argv[]) {
  unsigned long i;

  double dvec1[4] = {atof(argv[1]), atof(argv[2]), atof(argv[3]), atof(argv[4])};
  double dvec2[4] = {atof(argv[5]), atof(argv[6]), atof(argv[7]), atof(argv[8])}; 

#if 1
  for (i = 0; i < 3000000000; i++) {
    __vec_sum_4d(dvec1, dvec2, dvec2);
  }
#endif
#if 0
  for (i = 0; i < 3000000000; i++) {
    __vec_sum_4d_d64(dvec1, dvec2, dvec2);
  }
#endif

  printf("%10.10lf %10.10lf %10.10lf %10.10lf\n", dvec2[0], dvec2[1], dvec2[2], dvec2[3]);
}

I simply switch #if 1 to #if 0 and the other way around to switch between "modes" (AVX and non-AVX). My expectation would be, that the loop using AVX would be at least somewhat faster than the other one, but it isn't. I compiled the code with gcc version 10.2.0 (GCC) and these: -O2 --std=gnu99 -lm -mavx2 flags.

> time ./noavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000

real    0m2.150s
user    0m2.147s
sys 0m0.000s

> time ./withavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000

real    0m2.168s
user    0m2.165s
sys 0m0.000s

As you can see, they run at practically the same speed. I also tried to increase the number of iterations by a factor of ten, but the results will simply scale up proportionally. Also note that the printed output values are the same for both executables, so I think that it is save to say that both perform the same calculations. Digging deeper i took a look at the assembly and was even more confused. Here are the important parts of both (only the loop):

; With avx
1070:   c5 fd 58 c1             vaddpd %ymm1,%ymm0,%ymm0
1074:   48 83 e8 01             sub    $0x1,%rax
1078:   75 f6                   jne    1070

; Without avx
1080:   c5 fb 58 c4             vaddsd %xmm4,%xmm0,%xmm0
1084:   c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
1088:   c5 eb 58 d7             vaddsd %xmm7,%xmm2,%xmm2
108c:   c5 e3 58 de             vaddsd %xmm6,%xmm3,%xmm3
1090:   48 83 e8 01             sub    $0x1,%rax
1094:   75 ea                   jne    1080

In my understanding the second one should be way slower since besides decrementing the counter and the conditional jump there are four times as many instructions in it. Why is it not slower? Is the vaddsd instruction just four times faster than vaddpd?

If this is relevant, my system runs on a AMD Ryzen 5 2600X Six-Core Processor which supports AVX.

Solution

With AVX

; With avx
1070:   c5 fd 58 c1             vaddpd %ymm1,%ymm0,%ymm0
1074:   48 83 e8 01             sub    $0x1,%rax
1078:   75 f6                   jne    1070

This loop is using ymm0 as accumulator. In other words it is doing ymm0 += ymm1 (this is a vector operation; adding 4 double values at once). Therefore it has loop-carried dependency on ymm0 (every new addition has to wait for the previous addition to finish and uses the result to start the next addition). vaddpd has latency=3, throughput=1 for Zen+ (according to https://www.uops.info/table.html). Loop carried dependency makes this loop bottleneck on latency of vaddpd, so your loop can get at best 3 cycles/iteration. Only one vaddpd addition is in-flight in the CPU, which is under-utilizing it's capability by a lot.

To make this faster add more accumulators (have more vectors to sum). It can (in theory) get 3 times faster due to pipelining (3 full ymm additions in-flight), as long as it does not get limited by something else.

Without AVX

; Without avx
1080:   c5 fb 58 c4             vaddsd %xmm4,%xmm0,%xmm0
1084:   c5 f3 58 cd             vaddsd %xmm5,%xmm1,%xmm1
1088:   c5 eb 58 d7             vaddsd %xmm7,%xmm2,%xmm2
108c:   c5 e3 58 de             vaddsd %xmm6,%xmm3,%xmm3
1090:   48 83 e8 01             sub    $0x1,%rax
1094:   75 ea                   jne    1080

This loop accumulates results into 4 different accumulators. Basically it is doing:

xmm0 += xmm4
xmm1 += xmm5
xmm2 += xmm7
xmm3 += xmm6

All of these additions are independent from each other (and they are scalar additions, so each only operates on a single 64-bit floating point value). vaddsd has latency=3, throughput=0.5 (Cycles Per Instruction). Which means that it can start executing first 2 additions in one cycle. Then on the next cycle it will start the second pair of additions. Therefore it is possible to achieve 2 cycles/iteration for this loop based on throughput. But latency, as you recall is 3 cycles. So this loop is also bottlenecked on latency. Unroll once (with 4 additional accumulators; alternatively break loop-carried dep.chain within the loop by adding xmm4-7 between each other before adding it to the main accumulator) to get rid of that bottleneck (it may get ~50% faster).

Note that this ("without AVX") disassembly is still using VEX encoding, so technically still requires AVX-capable CPU.

On Benchmarking

Note that your disassembly does not have any loads or stores, so this may or may not be representative of performance comparison for adding 2 arrays of 4-double vectors.