I have found some solutions where each AVX2 register holds both, the real and imaginary part of the complex numbers. I am interested in a solution where each AVX2 registers holds either the real or the imaginary part.
Assuming we have 4 AVX2 registers:R1, I1, R2, I2
Registers R1, I1
form 4 complex numbers. Same applies for the remaining two registers. Now I want to multiply the 4 complex numbers of R1, I1
with the 4 complex numbers of R2, I2
. What would be the most efficient way to do this? Besides AVX2, FMA3 can be used as well.
You wrote you have AVX2, all Intel and AMD AVX2 processors also support FMA3. For this reason, I would do it like that.
// 4 FP64 complex numbers stored in 2 AVX vectors,
// de-interleaved into real and imaginary vectors
struct Complex4
{
__m256d r, i;
};
// Multiply 4 complex numbers by another 4 numbers
Complex4 mul4( Complex4 a, Complex4 b )
{
Complex4 prod;
prod.r = _mm256_mul_pd( a.r, b.r );
prod.i = _mm256_mul_pd( a.r, b.i );
prod.r = _mm256_fnmadd_pd( a.i, b.i, prod.r );
prod.i = _mm256_fmadd_pd( a.i, b.r, prod.i );
return prod;
}
Or if you targeting that one VIA processor which doesn’t have FMA, replace the FMA intrinsics with the following lines:
prod.r = _mm256_sub_pd( prod.r, _mm256_mul_pd( a.i, b.i ) );
prod.i = _mm256_add_pd( prod.i, _mm256_mul_pd( a.i, b.r ) );