ARM Intrinsic: Insert complex zero after each complex float sample

I have the following input:

[1i+2j], [3i+4j], [5i+6j],...

The output should be:

[1i+2j], [0i+0j], [3i+4j], [0i+0j], [5i+6j], [0i+0j],...

I wrote the following code:

void Extract (ComplexFloat *pIn, ComplexFloat*pOut, uint32_t N)
{
    ComplexFloat* pSrc = pIn;
    ComplexFloat *pDst = (ComplexFloat*)pOut;
    float32x2_t Zero;
    float32x4_t In, Out;
    float32x2_t HighIn, LowIn;

    Zero = vdup_n_f32 (0);

    //Loop on all input elements 
    for (int n = 0; n < N >> 1; n++)
    {
        In = vld1q_f32((float*)pSrc);
        HighIn = vget_high_f32(In);
        LowIn = vget_low_f32(In);
        Out = vcombine_f32(LowIn, Zero);
        vst1q_f32((float*)pDst, Out);
        pDst += 2;
        Out = vcombine_f32(HighIn, Zero);
        vst1q_f32((float*)pDst, Out);
        pDst += 2;
        pSrc += 2;
    }

}

Can you suggest a code with better performance ?

Thank you, Zvika

Solution

At least the following should give fewer instructions:

void interleave(float *src, float *dst, int blocks) {
  uint64x2x2_t data = { vdupq_n_u64(0), vdupq_n_u64(0) };
  do {
     data.val[0] = vreinterpretq_u64_f32(vld1q_f32(src)); src += 4;
     vst2q_u64(reinterpret_cast<uint64_t*>(dst), data); dst += 8;
  } while (--blocks);
}

We need to cast two adjacent floats a wider 64-bit element, after which we can use vst2 to interleave 64 bits at a time with no explicit zip instructions.