Search code examples
c++sseshiftneonintrinsics

Why does shift right in practice shifts left (and viceversa) in Neon and SSE?


(Note, in Neon I am using this data type to avoid dealing with conversions among 16-bit data types)

Why does "shift left" in intrinsics in practice "shift right"?

// Values contained in a
// 141 138 145 147 144 140 147 153 154 147 149 146 155 152 147 152
b = vshlq_n_u32(a,8);
// Values contained in b
// 0 141 138 145 0 144 140 147 0 154 147 149 0 155 152 147
b = vshrq_n_u32(a,8);
// Values contained in b
// 138 145 147 0 140 147 153 0 147 149 146 0 152 147 152 0

I remember finding the same situation when using _mm_slli_si128 (which is different though, a result after a shift will look like:

// b = _mm_slli_si128(a,1);
// 0 141 138 145 147 144 140 147 153 154 147 149 146 155 152 147

Is it because of endianness? Will it change from platform to platform?


Solution

  • You say "is this because of endianess" but it's more a case of type abuse. You're making assumptions about the bit ordering of the machine across byte/word boundaries and your non-byte instructions that impose local endianess on an operation (you're using an _u32 instruction which expects values that are unsigned 32 bit values, not arrays of 8 bit values).

    As you say, you are asking it to shift a series of unsigned char values by /asking/ it to shift values in 32 bit units.

    Unfortunately, you are going to need to put them in architecture order if you want to be able to do an architecture shift on them.

    Otherwise you may want to look for a blit or move instruction, but you can't artificially coerce machine types into machine registers without paying architectural costs. Endianness will be just one of your headaches (alignment, padding, etc)

    --- Late Edit ---

    Fundamentally, you are confusing byte and bit shifts, we consider most significant bits to be "left"

    bit number
    87654321
    
    hex
    8421
    00008421
    
    00000001  = 0x01 (small, less significant)
    10000000  = 0x80 (large, more significant)
    

    But the values you are shifting are 32 bit words, on a little endian machine that means the each subsequent address increases a more significant byte of the value, for a 32 bit word:

    bit numbers
                    1111111111111111
    87654321fedcba0987654321fedcba09
    

    To represent the 32-bit value 0x0001

                    1111111111111111
    87654321fedcba0987654321fedcba09
    
    00000001000000000000000000000000
    

    To shift it left by 2 positions

    00000001000000000000000000000000
         v<
    00000100000000000000000000000000
    

    to shift it left by another 8 positions we have to warp it to next address:

    00000100000000000000000000000000
          >>>>>>>v
    00000000000001000000000000000000
    

    This looks like a right shift if you are thinking in bytes. But we told this little-endian CPU that we were working on a uint32, so that means:

                    1111111111111111
    87654321fedcba0987654321fedcba09
     word01  word02  word03  word04   
    00000001000000000000000000000000 = 0x0001
    00000100000000000000000000000000 = 0x0004
    00000000000001000000000000000000 = 0x0400
    

    The problem is that this is a different order than the ordering you expect for a local array of 8 bit values, but you told the CPU the values were _u32 so it used it's native endianess for the operation.