ARM NEON: Regular C code is faster than ARM Neon code in simple multiplication?

I am implementing a simple multiplication for an array using ARM NEON intrinsics. The input is an uint8 array and the output is an uint16 array. However, the regular native code is faster than the NEON optimzied one. Can anyone help figure out how I can improve the NEON code?

My regular code is

    uint16_t scale_factor = 300;
    for(int i = 0; i < output_size; i++)
    {        
        out_16bit[i] = (uint16_t)(in_ptr[i] * scale_factor) ;
    }

My NEON code is

    uint16_t* out_ptr = out_16bit;
    uint8_t* in_ptr = in_8bit;
    uint16_t scale_factor = 300;

    for(int i = 0; i < out_size/16; i++)
    {
        uint8x16_t in_v0 = vld1q_u8(in_ptr);
        in_ptr += 16;

        uint16x8_t in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
        uint16x8_t in_16_v1 = vmovl_u8(vget_high_u8(in_v0));

        uint16x8_t res_0 = vmulq_n_u16(in_16_v0, scale_factor);
        uint16x8_t res_1 = vmulq_n_u16(in_16_v1, scale_factor);

        // code below takes long time
        vst1q_u16(out_ptr,res_0);  
        vst1q_u16(out_ptr+8,res_1);  
        out_ptr += 16;

    }

I also did some profiling and found out that if I comment out either vst1q_u16s or out_ptr += 16, the speed is fast. But if I keep both as above, it's very slow. So I guess it might be because the increment of pointer is waiting the finishing of vst1q_u16? Then I updated the NEON code to add some code between vst1q_u16 and out_ptr+=16 as below,

    uint8x16_t in_v0 = vld1q_u8(in_ptr);
    uint16x8_t in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
    uint16x8_t in_16_v1 = vmovl_u8(vget_high_u8(in_v0));

    uint16x8_t res_0 = vmulq_n_u16(in_16_v0, scale_factor);
    uint16x8_t res_1 = vmulq_n_u16(in_16_v1, scale_factor);
    vst1q_u16(out_ptr,res_0);  
    vst1q_u16(out_ptr+8,res_1);  
    for(int i = 1; i < out_size/16; i++)
    {

        in_v0 = vld1q_u8(in_ptr);
        in_16_v0 = vmovl_u8(vget_low_u8(in_v0));
        in_16_v1 = vmovl_u8(vget_high_u8(in_v0));
    
        out_ptr += 16;

        res_0 = vmulq_n_u16(in_16_v0, scale_factor);
        res_1 = vmulq_n_u16(in_16_v1, scale_factor);

        vst1q_u16(out_ptr,res_0);  
        vst1q_u16(out_ptr+8,res_1);  

    
    }

But this change didn't work...Please help advise what I should do...Thank you.

Solution

The simple answer, as in the comments, is auto-vectorization. I'm unsure for clang 6, but certainly more recent clang will by default auto-vectorize to Neon when targeting Neon platforms, and it will be very hard to beat that auto-vectorization on something as simple as this multiplication. Maybe with the best loop unrolling for your particular processor. But it is very easy to be worse than auto-vectorization. Godbolt is a very good way to go to compare, along with profiling all your changes.

All the comments make good points too.

For more documentation on best practice for Neon intrinsics, Arm's Neon microsite has very useful information, especially the doc on Optimizing C with Neon intrinsics.