c++memory stdvector memory-alignment avx2

Usage of alignas in template argument of std::vector

I want to create a std::vector with doubles. But these doubles should be aligned in 32 byte for AVX2 registers. What would be the best way to do this? Can I simply write something like
std::vector<alignas(32)double>?

Thanks in advance for the help.

Solution

If alignas(32)double compiled, it would require that each element separately had 32-byte alignment, i.e. pad each double out to 32 bytes, completely defeating SIMD. (I don't think it will compile, but similar things with GNU C typedef double da __attribute__((aligned(32))) do compile that way, with sizeof(da) == 32.)

See Modern approach to making std::vector allocate aligned memory for working code.

As of C++17, std::vector<__m256d> would work, but is usually not what you want because it makes scalar access a pain.

C++ sucks for this in my experience, although there might be a standard (or Boost) allocator that takes an over-alignment you can use as the second (usually defaulted) template param.

std::vector<double, some_aligned_allocator<32> > still isn't type-compatible with normal std::vector, which makes sense because any function that might reallocated it has to maintain alignment. But unfortunately that makes it not type-compatible even for passing to functions that only want read-only access to a std::vector of double elements.

Cost of misalignment

For a lot of cases the misalignment is only a couple percent worse than aligned, for AVX/AVX2 loops over an array if data's coming from L3 cache or RAM (on recent Intel CPUs); only with 64-byte vectors do you get a significantly bigger penalty (like 15% or so even when memory bandwidth is still the bottleneck.) You'd hope that the CPU core would have time to deal with it and keep the same number of outstanding off-core transactions in flight. But it doesn't.

For data hot in L1d, misalignment could hurt more even with 32-byte vectors.

In x86-64 code, alignof(max_align_t) is 16 on mainstream C++ implementations, so in practice even a vector<double> will end up aligned by 16 at least because the underlying allocator used by new always aligns at least that much. But that's very often an odd multiple of 16, at least on GNU/Linux. Glibc's allocator (also used by malloc) for large allocations uses mmap to get a whole range of pages, but it reserves the first 16 bytes for bookkeeping info. This is unfortunate for AVX and AVX-512 because it means your arrays are always misaligned unless you used aligned allocations. (How to solve the 32-byte-alignment issue for AVX load/store operations?)

Mainstream std::vector implementations are also inefficient when they have to grow: C++ doesn't provide a realloc equivalent that's compatible with new/delete, so it always has to allocate more space and copy to the start. Never even trying to allocate more space contiguous with the existing mapping (which would be safe even for non-trivially-copyable types), and not using implementation-specific tricks like Linux mremap to map the same physical pages to a different virtual address without having to copy all those mega/gigabytes. The fact that C++ allows code to redefine operator new means library implementations of std::vector can't just use a better allocator, either. All of this is a non-problem if you .reserve the size you're going to need, but it is pretty dumb.