I have the following input:
[1i+2j], [3i+4j], [5i+6j],...
The output should be:
[1i+2j], [0i+0j], [3i+4j], [0i+0j], [5i+6j], [0i+0j],...
I wrote the following code:
void Extract (ComplexFloat *pIn, ComplexFloat*pOut, uint32_t N)
{
ComplexFloat* pSrc = pIn;
ComplexFloat *pDst = (ComplexFloat*)pOut;
float32x2_t Zero;
float32x4_t In, Out;
float32x2_t HighIn, LowIn;
Zero = vdup_n_f32 (0);
//Loop on all input elements
for (int n = 0; n < N >> 1; n++)
{
In = vld1q_f32((float*)pSrc);
HighIn = vget_high_f32(In);
LowIn = vget_low_f32(In);
Out = vcombine_f32(LowIn, Zero);
vst1q_f32((float*)pDst, Out);
pDst += 2;
Out = vcombine_f32(HighIn, Zero);
vst1q_f32((float*)pDst, Out);
pDst += 2;
pSrc += 2;
}
}
Can you suggest a code with better performance ?
Thank you, Zvika
At least the following should give fewer instructions:
void interleave(float *src, float *dst, int blocks) {
uint64x2x2_t data = { vdupq_n_u64(0), vdupq_n_u64(0) };
do {
data.val[0] = vreinterpretq_u64_f32(vld1q_f32(src)); src += 4;
vst2q_u64(reinterpret_cast<uint64_t*>(dst), data); dst += 8;
} while (--blocks);
}
We need to cast two adjacent floats a wider 64-bit element, after which we can use vst2
to interleave 64 bits at a time with no explicit zip instructions.