Search code examples
memorysimdsseintrinsicsicc

SSE divrem memory store requirements


I'm searching for information on the divrem intrinsic sequences and their memory requirements (for the store).

These folks (check SSE and SVML to see the intel intrinsics doc) :

__m128i _mm_idivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_idivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
__m128i _mm_udivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_udivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)

On the intel intrinsics guide, it states.

Divide packed 32-bit integers in a by packed elements in b, store the truncated results in dst, and store the remainders as packed 32-bit integers into memory at mem_addr.

FOR j := 0 to 3
    i := 32*j
    dst[i+31:i] := TRUNCATE(a[i+31:i] / b[i+31:i])
    MEM[mem_addr+i+31:mem_addr+i] := REMAINDER(a[i+31:i] / b[i+31:i])
ENDFOR
dst[MAX:128] := 0

Does this mean, mem_addr is expected to be aligned (as per store), unaligned (storeu), or is it supposed to be a single register output (__m128i on the stack)?


Solution

  • alignof(__m256i) == 32, so for portability to any other compilers that might implement this intrinsic (like clang-based ICX), you should point it at aligned memory, or just a __m128i / __m256i temporary and use a normal store intrinsic (store or storeu) to tell the compiler where you want it to go.

    As Homer512 points out with an example in https://godbolt.org/z/9szzjEo7c , ICC stores it with movdqu. But we can see it always uses unaligned loads/stores, also for deref of __m128i* pointers for inputs. GCC and clang do use alignment-required loads/stores when you promise them alignment (e.g. by deref of a __m128i*).

    The actual SVML function call QWORD PTR [__svml_idivrem4@GOTPCREL+rip] returns in XMM0 and XMM1; the by-reference output operand is fortunately an invention of the intrinsics API. So it will fully optimize away to pass the address of __m128i tmp and then store it somewhere.