Search code examples
gccarmsimdneon

Aligned load/store with NEON intrinsics in GCC


How can you make GCC generate load/store instructions for aligned access?

If we have something like:

uint8_t* p;
uint8x8x4_t r = vld4_u8(p);

How can you make GCC genereate a load instruction that requires 32 bytes alignment?


Solution

  • I think you can use __builtin_assume_aligned(ptr, size);

    e.g.

    #include <arm_neon.h>
    
    void blend4(uint8_t *src, uint8_t *dst)
    {
        uint8_t *aligned_src = __builtin_assume_aligned(src, 16);
        uint8_t *aligned_dst = __builtin_assume_aligned(dst, 16);
        uint8x8x4_t temp = vld4_u8(aligned_src);
        vst4_u8(aligned_dst, temp);
    }
    

    Generates:

    vld4.8  {d16-d19}, [r0:128]
    vst4.8  {d16-d19}, [r1:128]