I'm searching for information on the divrem
intrinsic sequences and their memory requirements (for the store).
These folks (check SSE and SVML to see the intel intrinsics doc) :
__m128i _mm_idivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_idivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
__m128i _mm_udivrem_epi32 (__m128i * mem_addr, __m128i a, __m128i b)
__m256i _mm256_udivrem_epi32 (__m256i * mem_addr, __m256i a, __m256i b)
On the intel intrinsics guide, it states.
Divide packed 32-bit integers in a by packed elements in b, store the truncated results in dst, and store the remainders as packed 32-bit integers into memory at mem_addr.
FOR j := 0 to 3
i := 32*j
dst[i+31:i] := TRUNCATE(a[i+31:i] / b[i+31:i])
MEM[mem_addr+i+31:mem_addr+i] := REMAINDER(a[i+31:i] / b[i+31:i])
ENDFOR
dst[MAX:128] := 0
Does this mean, mem_addr
is expected to be aligned (as per store
), unaligned (storeu
), or is it supposed to be a single register output (__m128i
on the stack)?
alignof(__m256i) == 32
, so for portability to any other compilers that might implement this intrinsic (like clang-based ICX), you should point it at aligned memory, or just a __m128i
/ __m256i
temporary and use a normal store intrinsic (store or storeu) to tell the compiler where you want it to go.
As Homer512 points out with an example in https://godbolt.org/z/9szzjEo7c , ICC stores it with movdqu
. But we can see it always uses unaligned loads/stores, also for deref of __m128i*
pointers for inputs. GCC and clang do use alignment-required loads/stores when you promise them alignment (e.g. by deref of a __m128i*
).
The actual SVML function call QWORD PTR [__svml_idivrem4@GOTPCREL+rip]
returns in XMM0 and XMM1; the by-reference output operand is fortunately an invention of the intrinsics API. So it will fully optimize away to pass the address of __m128i tmp
and then store it somewhere.