I'm working with integers and SSE and have become very confused about how endianness affects moving data in and out of registers.
Initially my understanding was as follows. If I have an array of 4 byte integers the memory would be laid out as follows since x86 architectures are little endian:
0D 0C 0B 0A 1D 1C 1B 1A 2D 2C 2B 2A .... nD nC nB nA
Where the letters A
, B
, C
and D
index the bytes within an integer element, and numbers index the element.
In an XMM register, my understanding is that four integers would be laid out as follows:
0A 0B 0C 0D 1A 1B 1C 1D 2A 2B 2C 2D 3A 3B 3C 3D
However, I'm pretty sure this picture is wrong for several reasons. The first is the documentation for the mm_load_si128
intrinsic, which is supposed to work for any integer data, but in the above picture should only work for one word size. Similarly there is this (archived) thread. Finally I see people writing code like the following:
__declspec(align(16)) int32_t A[N];
__m128i* As = (__m128i*)A;
The Wikipedia article on endianness says I should think of memory addresses increasing right to left. How about the following picture for memory then?
nA nB nC nD ... 2A 2B 2C 2D 1A 1B 1C 1D 0A 0B 0C 0D
And then in a register:
3A 3B 3C 3D 2A 2B 2C 2D 1A 1B 1C 1D 0A 0B 0C 0D
It's just a question of interpretation. We read/write digits of a number from left to right and highest digit to lowest digit. So for a 32-bit number with the highest byte A then B then C and lowest byte D we would read/write ABCD. We do the same notating a 128-bit integer.
3A3B3C3D 2A2B2C2D 1A1B1C1D 0A0B0C0D
But in a little endian system it reads and writes digits from the lowest address to the highest like this
0D0C0B0A 1D1C1B1A 2D2C2B2A 3D3C3B3A
For 16-bit integers it's the same logic. We could read/write it as
7A7B 6A6B 5A5B 4A4B 3A3B 2A2B 1A1B 0A0B
and the little endian computer read/stores it from lowest to highest address as
0B0A 1B1A 2B2A 3B3A 4B4A 5B5A 6A6B 7B7A
That's why there is only one instruction to read/write 32-bit, 16-bit and 8-byte integers int a 128-bit register: namely movdqa and movaps (or the unaligned variants movdqu and movups).