x86 simd intrinsics avx micro-optimization

what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256

I had been using _mm256_lddqu_si256 based on an example I found online. Later I discovered _mm256_loadu_si256. The Intel Intrinsics guide only states that the lddqu version may perform better when crossing a cache line boundary. What might be the advantages of loadu? In general how are these functions different?

Solution

There's no reason to ever use _mm256_lddqu_si256, consider it a synonym for _mm256_loadu_si256. lddqu only exists for historical reasons as x86 evolved towards having better unaligned vector load support, and CPUs that support the AVX version run them identically. There's no AVX512 version.

Compilers do still respect the lddqu intrinsic and emit that instruction, so you could use it if you want your code to run identically but have a different checksum or machine code bytes.

No x86 microarchitectures run vlddqu any differently from vmovdqu. I.e. the two opcodes probably decode to the same internal uop on all AVX CPUs. They probably always will, unless some very-low-power or specialized microarchitecture comes along without efficient unaligned vector loads (which have been a thing since Nehalem). Compilers never use vlddqu when auto-vectorizing.

lddqu was different from movdqu on Pentium 4. See History of … one CPU instructions: Part 1. LDDQU/movdqu explained.

lddqu is allowed to (and on P4 does do) two aligned 16B loads and takes a window of that data. movdqu architecturally only ever loads from the expected 16 bytes. This has implications for store-forwarding: if you're loading data that was just stored with an unaligned store, use movdqu because store-forwarding only works for loads that are fully contained within a previous store. But otherwise you generally always wanted to use lddqu. (This is why they didn't just make movdqu always use "the good way", and instead introduced a new instruction for programmers to worry about. But luckily for us, they changed the design so we don't have to worry about which unaligned load instruction to use anymore.)

It also has implications for correctness of observable behaviour on UnCacheable (UC) or Uncacheable Speculate Write-combining (UCSW, aka WC) memory types (which might have MMIO registers behind them.)

There's no code-size difference in the two asm instructions:

  # SSE packed-single instructions are shorter than SSE2 integer / packed-double
  4000e3:       0f 10 07                movups xmm0, [rdi]   

  4000e6:       f2 0f f0 07             lddqu  xmm0, [rdi]
  4000ea:       f3 0f 6f 07             movdqu xmm0, [rdi]

  4000ee:       c5 fb f0 07             vlddqu xmm0, [rdi]
  4000f2:       c5 fa 6f 07             vmovdqu xmm0, [rdi]
  # AVX-256 is the same as AVX-128, but with one more bit set in the VEX prefix

On Core2 and later, there's no reason to use lddqu, but also no downside vs. movdqu. Intel dropped the special lddqu stuff for Core2, so both options suck equally.

On Core2 specifically, avoiding cache-line splits in software with two aligned loads and SSSE3 palignr is sometimes a win vs. movdqu, especially on 2nd-gen Core2 (Penryn) where palignr is only one shuffle uop instead of 2 on Merom/Conroe. (Penryn widened the shuffle execution unit to 128b).

See Dark Shikaris's 2009 Diary Of An x264 Developer blog post: Cacheline splits, take two for more about unaligned-load strategies back in the bad old days.

The generation after Core2 is Nehalem, where movdqu is a single uop instruction with dedicated hardware support in the load ports. It's still useful to tell compilers when pointers are aligned (especially for auto-vectorization, and especially without AVX), but it's not a performance disaster for them to just use movdqu everywhere, especially if the data is in fact aligned at run-time.

I don't know why Intel even made an AVX version of lddqu at all. I guess it's simpler for the decoders to just treat that opcode as an alias for movdqu / vmovdqu in all modes (with legacy SSE prefixes, or with AVX128 / AVX256), instead of having that opcode decode to something else with VEX prefixes.

All current AVX-supporting CPUs have efficient hardware unaligned-load / store support that handles it as optimally as possible. e.g. when the data is aligned at runtime, there's exactly zero performance difference vs. vmovdqa.

This was not the case before Nehalem; movdqu and lddqu used to decode to multiple uops to handle potentially-misaligned addresses, instead of putting hardware support for that right in the load ports where a single uop can activate it instead of faulting on unaligned addresses.

However, Intel's ISA ref manual entry for lddqu says the 256b version can load up to 64 bytes (implementation dependent):

This instruction may improve performance relative to (V)MOVDQU if the source operand crosses a cache line boundary. In situations that require the data loaded by (V)LDDQU be modified and stored to the same location, use (V)MOVDQU or (V)MOVDQA instead of (V)LDDQU. To move a double quadword to or from memory locations that are known to be aligned on 16-byte boundaries, use the (V)MOVDQA instruction.

IDK how much of that was written deliberately, and how much of that just came from prepending (V) when updating the entry for AVX. I don't think Intel's optimization manual recommends really using vlddqu anywhere, but I didn't check.

There is no AVX512 version of vlddqu, so I think that means Intel has decided that an alternate-strategy unaligned load instruction is no longer useful, and isn't even worth keeping their options open.