I need to read a sequence of complex single precision numbers, stored like [real1, imag1, real2, imag2, ...] into ymm registers and unpack them such that, say, ymm0 contains [real1, real2, real3, ...] and ymm1 contains [imag1, imag2, imag3, ...]. The following code works, but uses four lane-crossing shuffles. Is there a more efficient way to accomplish this than what I'm doing here?
// the negatives here stand in for imaginary parts
float _f[] = {1, -1, 2, -2, 3, -3, 4, -4, 5, -5, 6, -6, 7, -7, 8, -8};
int i[] = {0, 2, 4, 6, 1, 3, 5, 7};
__m256 a = _mm256_loadu_ps(_f);
__m256 b = _mm256_loadu_ps(_f+8);
__m256i x = _mm256_loadu_si256((void*)i);
__m256 c = _mm256_permutevar8x32_ps(a, x);
__m256 d = _mm256_permutevar8x32_ps(b, x);
__m256 e = _mm256_permute2f128_ps(c, d, 0x20);
__m256 f = _mm256_permute2f128_ps(c, d, 0x31);
At the end of this sequence, e contains the real parts and f contains the imaginary parts. My only concern is that lane-crossing shuffles can be expensive on some machines.
As suggested in the comment by harold, this will do the job of separating the real and imaginary parts into seperate vectors, but the order won't be exactly right. Instead, e
will have [real1, real5, real2, real6, ...] and f
will have the corresponding imaginary parts. This may be good enough for some applications so I figured it was worth posting in case anybody else finds it useful
float _f[] = {1, -1, 2, -2, 3, -3, 4, -4, 5, -5, 6, -6, 7, -7, 8, -8};
__m256 a = _mm256_loadu_ps(_f);
__m256 b = _mm256_loadu_ps(_f+8);
__m256 c = _mm256_permute_ps(a, 0xd8);
__m256 d = _mm256_permute_ps(b, 0xd8);
__m256 e = _mm256_unpacklo_ps(c,d);
__m256 f = _mm256_unpackhi_ps(c,d);
EDIT: And, as pointed out by Peter Cordes, the following even shorter solution produces [real1, real2, real5, real6, real3, real4, real7, real8] and the corresponding imaginaries.
float _f[] = {1, -1, 2, -2, 3, -3, 4, -4, 5, -5, 6, -6, 7, -7, 8, -8};
__m256 a = _mm256_loadu_ps(_f);
__m256 b = _mm256_loadu_ps(_f+8);
__m256 c = _mm256_shuffle_ps(a, b, 0x88);
__m256 d = _mm256_shuffle_ps(a, b, 0xdd);