Search code examples
c++armsseneonintrinsics

Translating SSE to Neon: How to pack and then extract 32bit result


I have to translate the following instructions from SSE to Neon

 uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );

Where:

static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3,  7,  11, 15, -1, -1, -1, -1,
                                                  -1, -1, -1, -1, -1, -1, -1, -1);

So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).

How does this operation translate in Neon?
Should I use packing instructions?
How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32?)

Edit:
To start with, vgetq_lane_u32 should allow to replace _mm_cvtsi128_si32 (but I will have to cast my uint8x16_t to uint32x4_t)

uint32_t  vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);

or directly store the lane vst1q_lane_u32

void  vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]

Solution

  • I found this excellent guide. I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.

    uint8x8x2_t   vuzp_u8(uint8x8_t a, uint8x8_t b);
    

    So something like:

    uint8x16_t a;
    uint8_t* out;
    [...]
    
    //a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0
    
    a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
    //a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0
    
    a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
    //a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0
    
    vst1q_lane_u32(out,a,0);
    

    Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))

    But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):

    uint8x8x2_t* d = (uint8x8x2_t*) &a;
    *d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
    *d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
    vst1q_lane_u32(out,a,0);
    

    I have implemented a more general workaround through a flexible data type:

    NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
    a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
    a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
    vst1q_lane_u32(out,a,0);
    

    Edit:

    Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.

    static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
    NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
    NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
    [...]
    res = vtbl2_u8(a, MASK);
    vst1_lane_u32(out,res,0);