Currently, I've got a __m128i variable, let's call it X
. I want to xor it with a constant 128bit value and save the value back to to X
. So, essentially X ^= C
for some constant C
.
Currently, I'm doing something along the lines of:
X = _mm_xor_si128(X, _mm_set_epi64x(C_a, C_b))
Which builds a __m128i
from the two 64-bit parts of C
for the xor.
My question is, this does doesn't seem like the most efficient way to initialize the __m128i constant for the xor. Would it be better to try to do some load from an aligned array? Or some other method?
I'm currently working with MSVC in Visual Studio.
This answer is purely about the case of constant C
. If you have non-constant inputs, it matters where they're coming from (memory, registers, a recent computation that you could maybe do in vector registers in the first place?) and potentially what you're doing to do with the resulting vector. Shuffling separate scalar variables into / out of SIMD vectors kinda sucks, with a tradeoff between ALU port bottlenecks vs. latency and throughput of store/reload (and the store forwarding stall for scalar -> vector). Store/reload is good in asm for getting lots of small elements out of a SIMD vector when you do want them all, though.
For constant C_a
and C_b
, even MSVC does a good job at constant-propagation through that _mm_set
. So there's no advantage to writing an implementation-specific initializer like SSE Error - Using m128i_i32 to define fields of a __m128i variable
Remember that the real determiner of performance is the assembly you can coax the compiler into producing, not really which intrinsics you use to do that.
#include <immintrin.h>
__m128i xor_const(__m128i v) {
return _mm_xor_si128(v, _mm_set_epi64x(0x789abc, 0x123456));
}
Compiled (on Godbolt) with x64 MSVC -O2 Gv (to use vectorcall so we can see what it does when a vector is already in a register, like when this inlines), we get this fairly stupid asm which hopefully wouldn't be this bad in a larger function after inlining:
;; MSVC 19.10
;; this is in the .rdata section; godbolt just filters directives that aren't interesting
;; "everyone knows" that compilers put data in the right sections
__xmm@0000000000789abc0000000000123456 DB 'V4', 012H, 00H, 00H, 00H, 00H, 00H
DB 0bcH, 09aH, 'x', 00H, 00H, 00H, 00H, 00H
xor_const@@16 PROC ; COMDAT
movdqa xmm1, XMMWORD PTR __xmm@0000000000789abc0000000000123456
pxor xmm1, xmm0
movdqa xmm0, xmm1
ret 0
xor_const@@16 ENDP
We can see that the _mm_set
intrinsic compiled to a 16-byte constant in static storage, like we want. Failure to use pxor xmm0, xmm1
is surprising, but MSVC is well known for asm that's often not quite as good compared to GCC and/or clang. Again, as part of a large function when it has a choice of registers, we'd probably have no extra movdqa
. And if the xor was in a loop, loading once outside a loop is what we want anyway. This wasn't the most recent MSVC version; Godbolt only has the most up-to-date MSVC versions installed for C++, not C, but you tagged this C.
By comparison, GCC9.2 -O3 compiles to the expected memory-source PXOR that's efficient on all CPUs.
xor_const:
pxor xmm0, XMMWORD PTR .LC0[rip]
ret
.section .rodata # Godbolt strips out stuff like section directive; re-added manually
.LC0:
.quad 1193046
.quad 7903932
You could probably get a compiler to emit the same asm with a static alignas(16)
array holding the constant, and _mm_load_si128()
from that. But why bother?
One thing to avoid is writing static const __m128i C = _mm_set...
- compilers are super dumb with this and will not fold the _mm_set
into a static constant initializer for the __m128i
. C compilers will refuse to compile a non-constant static initializer. C++ compilers will reserve some BSS space and run a constructor-like function to copy from a read-only constant into that BSS space, because _mm_set
doesn't fully optimize away in that case.