Why does shift right in practice shifts left (and viceversa) in Neon and SSE?

(Note, in Neon I am using this data type to avoid dealing with conversions among 16-bit data types)

Why does "shift left" in intrinsics in practice "shift right"?

// Values contained in a
// 141 138 145 147 144 140 147 153 154 147 149 146 155 152 147 152
b = vshlq_n_u32(a,8);
// Values contained in b
// 0 141 138 145 0 144 140 147 0 154 147 149 0 155 152 147
b = vshrq_n_u32(a,8);
// Values contained in b
// 138 145 147 0 140 147 153 0 147 149 146 0 152 147 152 0

I remember finding the same situation when using _mm_slli_si128 (which is different though, a result after a shift will look like:

// b = _mm_slli_si128(a,1);
// 0 141 138 145 147 144 140 147 153 154 147 149 146 155 152 147

Is it because of endianness? Will it change from platform to platform?

Solution

You say "is this because of endianess" but it's more a case of type abuse. You're making assumptions about the bit ordering of the machine across byte/word boundaries and your non-byte instructions that impose local endianess on an operation (you're using an _u32 instruction which expects values that are unsigned 32 bit values, not arrays of 8 bit values).

As you say, you are asking it to shift a series of unsigned char values by /asking/ it to shift values in 32 bit units.

Unfortunately, you are going to need to put them in architecture order if you want to be able to do an architecture shift on them.

Otherwise you may want to look for a blit or move instruction, but you can't artificially coerce machine types into machine registers without paying architectural costs. Endianness will be just one of your headaches (alignment, padding, etc)

--- Late Edit ---

Fundamentally, you are confusing byte and bit shifts, we consider most significant bits to be "left"

bit number
87654321

hex
8421
00008421

00000001  = 0x01 (small, less significant)
10000000  = 0x80 (large, more significant)

But the values you are shifting are 32 bit words, on a little endian machine that means the each subsequent address increases a more significant byte of the value, for a 32 bit word:

bit numbers
                1111111111111111
87654321fedcba0987654321fedcba09

To represent the 32-bit value 0x0001

                1111111111111111
87654321fedcba0987654321fedcba09

00000001000000000000000000000000

To shift it left by 2 positions

00000001000000000000000000000000
     v<
00000100000000000000000000000000

to shift it left by another 8 positions we have to warp it to next address:

00000100000000000000000000000000
      >>>>>>>v
00000000000001000000000000000000

This looks like a right shift if you are thinking in bytes. But we told this little-endian CPU that we were working on a uint32, so that means:

                1111111111111111
87654321fedcba0987654321fedcba09
 word01  word02  word03  word04   
00000001000000000000000000000000 = 0x0001
00000100000000000000000000000000 = 0x0004
00000000000001000000000000000000 = 0x0400

The problem is that this is a different order than the ordering you expect for a local array of 8 bit values, but you told the CPU the values were _u32 so it used it's native endianess for the operation.