Search code examples
c++simdintrinsicsavxavx2

left shift of 128 bit number using AVX2 instruction


I am trying to do left rotation of a 128 bit number in AVX2. Since there is no direct method of doing this, I have tried using left shift and right shift to accomplish my task.

Here is a snippet of my code to do the same.

        l = 4;
        r = 4;
        targetrotate = _mm_set_epi64x (l, r);
        targetleftrotate = _mm_sllv_epi64 (target, targetrotate);

The above c ode snippet rotates target by 4 to the left.
When I tested the above code with a sample input, I could see the result is not rotated correctly.

Here is the sample input and output

          input: 01 23 45 67 89 ab cd ef   fe dc ba 98 76 54 32 10
obtained output: 10 30 52 74 96 b8 da fc   e0 cf ad 8b 69 47 25 03

But, the output I expect is

                 12 34 56 78 9a bc de f0   ed cb a9 87 65 43 21 00

I know that I am doing something wrong. I want to know whether my expected output is right and if so, I want to know what am I doing wrong here.

Any kind of help would be greatly appreciated and thanks in advance.


Solution

  • I think you have an endian issue with how you're printing your input and output.

    The left-most bytes within each 64-bit half are the least-significant bytes in your actual output, so 0xfe << 4 becomes 0xe0, with the f shifting into a higher byte.

    See Convention for displaying vector registers for more discussion of that.

    Your "expected" output matches what you'd get if you were printing values high element first (highest address when stored). But that's not what you're doing; you're printing each byte separately in ascending memory order. x86 is little-endian. This conflicts with the numeral system we use in English, where we read Arabic numerals from left to right, highest place-value on the left, effectively human big-endian. Fun fact: The Arabic language reads from right to left so for them, written numbers are "human little-endian".

    (And across elements, higher elements are at higher addresses; printing high elements first makes whole-vector shifts like _mm_bslli_si128 aka pslldq make sense in the way it shifts bytes left between elements.)

    If you're using a debugger, you're probably printing within that. If you're using debug-prints, see print a __m128i variable.


    BTW, you can use _mm_set1_epi64x(4) to put the same value in both elements of a vector, instead of using separate l and r variables with the same value.

    In _mm_set intrinsics, the high elements come first, matching the diagrams in Intel's asm manuals, and matching the semantic meaning of "left" shift moving bits/bytes to the left. (e.g. see Intel's diagrams an element-numbering for pshufd, _mm_shuffle_epi32)


    BTW, AVX512 has vprolvq rotates. But yes, to emulate rotates you want a SIMD version of (x << n) | x >> (64-n). Note that x86 SIMD shifts saturate the shift count, unlike scalar shifts which mask the count. So x >> 64 will shift out all the bits. If you want to support rotate counts above 63, you probably need to mask.

    (Best practices for circular shift (rotate) operations in C++ but you're using intrinsics so you don't have to worry about C shift-count UB, just the actual known hardware behaviour.)