Simple AVX512 dot-product loop only 10.6x faster, expected 16x

The task is to sum the products of multiplying each float in array A with the corresponding element in array B. The arrays could have tens of thousands of elements, and must run say 100,000x sec to handle a real-time data stream, so performance is key.

I've coded it using regular math and again with AVX512. It is about 10.6x faster, which is a surprise, as I expected 16x or so given that I'm doing 16x operations per instruction. Furthermore while the loop has various overhead (e.g., looping variables, increments, branch if continuing loop, etc.) it is doing 1/16th of those compared to the naive version.

I'm compiling in Visual Studio 2022 Community, in release mode, and running on an i7-11700F.

Here's the lines of code. I basically step through the two arrays 16 elements at a time, multiply the respective elements, and keep 16 running sums. At the very end of calculation I use _mm512_reduce_add_ps() to sum those 16 sums.

vector<__m512>      a512In;
vector<__m512>      a512IRCurr;
__m512 fOut = _mm512_set1_ps( 0.0 );

for ( iSample = 0; iSample < iIterations; iSample++ ) 
    fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
                                               a512IRCurr[ iSample ] ) );

I see vmobups doesn't assume the target is aligned, and wonder if that's the problem. However I also see that the unaligned versions have been the same speed for many generations as the aligned versions, but a troubling note that latency may still differ: https://community.intel.com/t5/Intel-ISA-Extensions/what-are-the-performance-implications-of-using-vmovups-and/m-p/1143448 While I'm comfortable with machine language of the 6502 variety I don't know modern Intel.

I also wonder if the _mm512_add_ps is the right instruction for a = a * b constructs, or whether there's a faster a *= b type instruction.

    for ( iSample = 0; iSample < iIterations; iSample++ )
00007FF6677B2958  movsxd      r8,edi  
00007FF6677B295B  test        edi,edi  
00007FF6677B295D  jle         Circle2AVR512::ProcessInput+0AEh (07FF6677B298Eh)  
        fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
00007FF6677B295F  movsxd      r9,eax  
00007FF6677B2962  mov         rdx,r11  
00007FF6677B2965  shl         r9,6  
00007FF6677B2969  mov         ecx,edi  
00007FF6677B296B  sub         r9,r11  
00007FF6677B296E  add         r9,qword ptr [r10]  
00007FF6677B2971  vmovups     zmm0,zmmword ptr [r9+rdx]  
00007FF6677B2978  vmulps      zmm1,zmm0,zmmword ptr [rdx]  
00007FF6677B297E  lea         rdx,[rdx+40h]  
00007FF6677B2982  vaddps      zmm2,zmm2,zmm1  
00007FF6677B2988  sub         r8,1  
00007FF6677B298C  jne         Circle2AVR512::ProcessInput+91h (07FF6677B2971h)  
                                                   a512IRCurr[ iSample ] ) );

Solution

The TLDR: it's memory speed. My L1 cache supports 16-18x speedup, L2 about 10-12x, and L3 about 4.3x, when each is compared to a naive single-data-per-instruction C++ implementation. Once data no longer fits in L1, my CPU cannot fully utilize the AVX512 instructions' parallelism. That's the narrow answer to my narrow question.

The more in-depth answer: I'm sheepish to take credit for answering when so I've gotten so many ideas from comments on the post, but here are my findings.

The program processes input data sample by sample, which is stored in a circular queue, and for every sample multiplies the last N samples of input times the corresponding value in a second array called an impulse response. To be clear, I multiply the newest input sample with cell 0 in the impulse response array, the second-newest with cell 1 and so on. The newest input sample I multiply by cell 0 in the impulse response this time, will be multiplied by cell 1 instead when I'm called again for the next input sample, and so on. All of these products are summed, and that sum is the output of the function.

I made a dozen variations on this matrix multiplication, and use as the baseline one I called "Circle2" (a type of circular queue that doesn't check for wraparound with every operation, rather is composed of two loops whose starting point and iteration count are calculated before entering the loops). Circle2AVX512 basically keeps the same data structures but steps by multiples of 16, and casts float*'s to __m512*. Accesses on the input queue happen to still be aligned but accesses to the impulse response are mostly non-aligned. Circle2AVX512Aligned keeps 16 copies of the impulse response, one at each possible memory alignment. Accesses again are always aligned on the circular queue, but now also aligned on the impulse response, by selecting the edition of the impulse response with the desired alignment.

My CPU, the i7-11700F, has 32kb L1D (D=data) cache and 256kb L2 cache per core. Then, 16M L3 cache serves the whole computer. (I was web-surfing, youtubing and word processing so not all of this was necessarily available to the test.)

Looking at Circle2AVX512, my single-threaded app was able to get 16-18x speedup over a naively-programmed though optimized C++ implementation, as long as the data (input buffer plus impulse response) is big enough to overcome the fixed overhead but small enough to fit in L1--32k total, or in other words, 16k each for each of two structures: 4000 samples each, in 4-byte floats.

Then performance fell to a lower plateau of 10-12x improvement over naive single-data math, until the two structures grew past 256k, at which L2 was no longer sufficient.

At that point it maintains a steady 8.4x to 9.0x speedup over the naive code at least up to 6.4M (two 3.2M data structures).

A future direction of research would be to see how it performs when the L3 cache is done.

Note the access pattern of memory is the classic worst-case for an MRU cache: linear access, mockingly dropping data from the cache just before you're about to finally get back to it. However, as a direction for further work, the software could also take advantage of the classic-best-case for an MRU cache: extreme locality of reference. If the software were presented with multiple input samples in a group, this could be highly optimized by utilizing the same ranges of both the input queue and the impulse response before moving on, and thus achieving extremely high locality of reference. Calculating the first sample would still have to re-load all data from cache but the next group of input samples could free-ride off that loaded cache.

Frankly I don't understand why the Circle2AVX512Aligned algo is performing as well as we see here. Instead of one copy of the impulse response, it has 16 completely separate copies. So it should run out of L1 cache at about 3.75k per data structure (again, the input queue is the same size in bytes as the impulse response; there's simply 16 impulse responses instead of one). And indeed, that's where we see it switch from significantly out-performing the non-aligned version, to underperforming. And yet I'm still confused why, on this modern CPU that sources tell me should be able to access unaligned memory as fast, why the aligned version is so much faster. And I'm also confused why the aligned version doesn't get far slower than it does. My only guess is that the input queue stays nailed to the cache so at least we have that going for us.

Likewise the Circle2AVX512Aligned no longer fits into L2 cache at 256k/17 when the individual structures are 256k/17=9.17k each. OK, a gap does open up at that point but not as big as I'd have expected.

Finally the Circle2AVX512Aligned no longer fits into L3 cache at 16M/17 when the individual structures are 16M/17=963k each, or so. And this time, much clearer degradation.