What is more efficient and why?
Specifically _mm_loadu_si128
vs. _mm_load_si128
in C.
(Editor's note: or this was tagged assembly, possibly they meant movdqu
vs. movdqa
in hand-written asm. Which is not the same thing, especially without AVX, because _mm_load_si128
can compile into a memory operand for an ALU instruction with no separate movdqa
at all.)
loadu
is used for misaligned loads (from addresses that are not aligned to a 16 byte multiple) and load
is used for aligned loads. If you know that your source address is correctly aligned then load
would normally be more efficient as it only needs one read cycle and doesn't have to deal with fixing up multiple chunks of misaligned data. On older Intel CPUs the performance penalty for misaligned loads was quite significant (typically > 2x) but on more recent CPUs (e.g. Core i5/i7) the penalty is almost negligible. Note that using loadu
for aligned data is OK apart from the aforementioned performance penalty, but using load
with misaligned data will generate an exception (i.e. crash).