What is the most efficient way to do 128 bit shift on a modern Intel CPU (core i7, sandy bridge).
A similar code is in my most inner loop:
u128 a[N];
void xor() {
for (int i = 0; i < N; ++i) {
a[i] = a[i] ^ (a[i] >> 1) ^ (a[i] >> 2);
}
}
The data in a[N]
is almost random.
Using instruction Shift Double.
So SHLD
or SHRD
instruction, because SSE isn't intended for this purpose.
There is a clasic method, here are you have test cases for 128 bit left shift by 16 bits under 32 and 64 bit CPU mode.
On this way you can perform unlimited size shift for up to 32/64 bits. Yoo can shift for immediate number of bits or for number in cl register. First instruction operant can also address variable in memory.
128 bit left shift by 16 bits under 32 bit x86 CPU mode:
mov eax, $04030201;
mov ebx, $08070605;
mov ecx, $0C0B0A09;
mov edx, $100F0E0D;
shld edx, ecx, 16
shld ecx, ebx, 16
shld ebx, eax, 16
shl eax, 16
And 128 bit left shift by 16 bits under 64 bit x86 CPU mode:
mov rax, $0807060504030201;
mov rdx, $100F0D0E0B0C0A09;
shld rdx, rax, 16
shl rax, 16