What is the difference between loadu and load?

What is more efficient and why?

Specifically _mm_loadu_si128 vs. _mm_load_si128 in C.

(Editor's note: or this was tagged assembly, possibly they meant movdqu vs. movdqa in hand-written asm. Which is not the same thing, especially without AVX, because _mm_load_si128 can compile into a memory operand for an ALU instruction with no separate movdqa at all.)

Solution

loadu is used for misaligned loads (from addresses that are not aligned to a 16 byte multiple) and load is used for aligned loads. If you know that your source address is correctly aligned then load would normally be more efficient as it only needs one read cycle and doesn't have to deal with fixing up multiple chunks of misaligned data. On older Intel CPUs the performance penalty for misaligned loads was quite significant (typically > 2x) but on more recent CPUs (e.g. Core i5/i7) the penalty is almost negligible. Note that using loadu for aligned data is OK apart from the aforementioned performance penalty, but using load with misaligned data will generate an exception (i.e. crash).