AVX2: What is the best way to multiply and sum 4 complex values with 4 double values?

One big hotspot in EM simulation within the xnec2c project takes this form, and the same form is repeated throughout the calculation in various ways:

*dst += (*a1) * (*a2) + (*b1) * (*b2) + (*c1) * (*c2);  // + (*d1) * (*d2)

(This Github link is a real example from matrix.c):

Typically a1, b1, c1 are complex doubles. Sometimes a2, b2, c2 are complex, other times they are real doubles. (I commented d1,d2 because while they don't exist in the EM simulation, I'm guessing we get them for free in AVX calculations so the code below below writes for 4 doubles).

Below is a modified version of Peter Cordes's answer to:

Multiply 4 complex numbers against 4 doubles as above
Sum result.
Add the sum from #2 to *dst

static inline void avx_fma_4zd4d_sum(double complex * restrict dst,
  const double complex * restrict A, 
  double *b0, double *b1, double *b2, double *b3)
{
    // low element first, little-endian style
    __m256d A0 = _mm256_loadu_pd((double *) A); // [A0r A0i A1r A1i ] // [a b c d ]
    __m256d A2 = _mm256_loadu_pd((double *) (A + 2));   // [e f g h ]

    // Q: b0-b3 are spread out in memory. Is this the best way to load them?
    __m256d B0 = _mm256_set_pd(*b1, *b1, *b0, *b0); // reverse from realA order because set_pd is.
    __m256d B2 = _mm256_set_pd(*b3, *b3, *b2, *b2); 

    // desired: real=rArB, imag=rBiA
    A0 = _mm256_mul_pd(A0, B0);
    A2 = _mm256_mul_pd(A2, B2);

    // sum A0-A3
    A0 = _mm256_add_pd(A0, A2);

    // Q: Now A0 has 2x complex doubles in a single vector.  How do I sum them?

    // Q: Once I do, how do I add the result to dst?

    //_mm256_storeu_pd((double *) dst, A0);
    //_mm256_storeu_pd((double *) (dst + 2), A2);
}

You can see my questions in the comments above. I was looking at this answer but it sums a vector row of all doubles, mine are complex doubles so it looks like [A0r, A0i, A1r, A1i] and the sums need to interleave. The result would be this if I jumped out of intrinsics, but I'd like to understand the intrinsics detail for this operation:

dst += (A0[0]+A0[2]) + (A0[1]+A0[3]) * I;

Note that *dst is not in memory near *A so it will need to be a separate load.

I don't have FMA hardware, but I would like it to compile such that gcc/clang will reduce to FMA if available.

Related Questions:

What if a2, b2, c2, d2 are complex? (named b0-b3 in the vector function)
- Is summing to dst easier when they all complex values?
I can typically put a1, b1, c1 in the same vector, but a2, b2, c2 are all over memory. Is there a better way to load them than using _mm256_set_pd?

Thanks for your help!

Solution

Here’s some pieces you gonna need. Untested.

// 4 FP64 complex numbers in 2 AVX SIMD vectors
struct Complex4 { __m256d vec1, vec2; };

// Change the macro if you have FMA and optimizing for Intel
#define USE_FMA 0

Complex4 add( Complex4 a, Complex4 b )
{
    Complex4 res;
    res.vec1 = _mm256_add_pd( a.vec1, b.vec1 );
    res.vec2 = _mm256_add_pd( a.vec2, b.vec2 );
    return res;
}

// Multiply vectors component-wise.
// This is not a complex multiplication, just 2 vmulpd instructions
Complex4 mul_cwise( Complex4 a, Complex4 b )
{
    __m256d r0, r1;
    Complex4 res;
    // Compute the product
    res.vec1 = _mm256_mul_pd( a.vec1, b.vec1 );
    res.vec2 = _mm256_mul_pd( a.vec2, b.vec2 );
    return res;
}

// Multiply + accumulate vectors vectors component-wise
// This is not a complex multiplication.
Complex4 fma_cwise( Complex4 a, Complex4 b, Complex4 acc )
{
#if USE_FMA
    Complex4 res;
    res.vec1 = _mm256_fmadd_pd( a.vec1, b.vec1, acc.vec1 );
    res.vec2 = _mm256_fmadd_pd( a.vec2, b.vec2, acc.vec2 );
    return res;
#else
    return add( mul_cwise( a, b ), acc );
#endif
}

// Load 4 scalars from memory, and duplicate them into 2 AVX vectors
// This is for the complex * real use case
Complex4 load_duplicating( double* rsi )
{
    Complex4 res;
#ifdef __AVX2__
    __m256d tmp = _mm256_loadu_pd( rsi );
    res.vec1 = _mm256_permute4x64_pd( tmp, _MM_SHUFFLE( 1, 1, 0, 0 ) );
    res.vec2 = _mm256_permute4x64_pd( tmp, _MM_SHUFFLE( 3, 3, 2, 2 ) );
#else
    res.vec1 = _mm256_setr_m128d( _mm_loaddup_pd( rsi ), _mm_loaddup_pd( rsi + 1 ) );
    res.vec2 = _mm256_setr_m128d( _mm_loaddup_pd( rsi + 2 ), _mm_loaddup_pd( rsi + 3 ) );
#endif
    return res;
}

// Load 4 complex numbers
Complex4 load_c4( double* rsi )
{
    Complex4 res;
    res.vec1 = _mm256_loadu_pd( rsi );
    res.vec2 = _mm256_loadu_pd( rsi + 4 );
    return res;
}

// Multiply 2 complex numbers by another 2 complex numbers
__m256d mul_CxC_2( __m256d a, __m256d b )
{
    // The formula is [ a.x*b.x - a.y*b.y,  a.x*b.y + a.y*b.x ]
    // Computing it as a.xy * b.xx -+ a.yx * b.yy

    __m256d ayx = _mm256_permute_pd( a, _MM_SHUFFLE2( 0, 1 ) ); // a.yx
    __m256d byy = _mm256_permute_pd( b, _MM_SHUFFLE2( 1, 1 ) ); // b.yy
    __m256d p1 = _mm256_mul_pd( ayx, byy ); // a.yx * b.yy

    __m256d bxx = _mm256_permute_pd( b, _MM_SHUFFLE2( 0, 0 ) ); // b.xx

#if USE_FMA
    return _mm256_fmaddsub_pd( a, bxx, p1 );
#else
    return _mm256_addsub_pd( _mm256_mul_pd( a, bxx ), p1 );
#endif
}

Couple more notes.

Consider C++ for such code. For one, operators and overloaded functions help. Also, your whole hotspot can be written with a few lines with Eigen library, with performance often comparable to manually written intrinsics.

Use compiler flags or GCC-specific function attributes to make sure all of these functions are always inlined.

Add the sum from #2 to *dst

It’s never a good idea to store intermediate values for code like that. If you have 6 things to multiply/add on input, compute them all, and only store the result once. Memory access is inefficient in general, read-after-write especially so.

such that gcc/clang will reduce to FMA if available.

Reducing things to FMA is usually good idea on Intel, not necessarily on AMD. AMD chips have twice as high throughput of 50% additions / 50% multiplications instruction mix, compared to 100% FMA. Unlike Intel, on AMD floating point additions and multiplications don’t compete for execution units. By “throughput” I mean instructions/second; marketing people decided that 1 FMA operation = 2 floating point operations, so the FLOPs number is the same.