I want to have X amount of std::vectors of equal size, which I can be processed together in a for loop which goes from start to finish in a linear fashion. For example:
for (int i = 0; i < vector_length; i++)
vector1[i] = vector2[i] + vector3[i] * vector4[i];
I want all this to take full advantage of SIMD instructions. For this to happen, the compiler should be able to assume that each of the vectors are aligned optimally for __m256 use. If the compiler can't assume this, all sorts of non-optimal loops can be generated and used in the code.
How do I ensure this optimal alignment of std::vectors and optimal code generation for such aligned data?
It can be assumed that each vector has identical data structures inside, which can be added/multiplied together using standard SIMD instructions.
I'm using C++17.
MORE INFORMATION AS REQUESTED BY THE PEOPLE HERE:
32 bytes of alignment is good for my use.
I want to get this running on Intel Macs and PCs. (Xcode + Visual Studio) and later on ARM CPU Macs when I get one of those computers (Xcode again).
As couple of people pointed out, there's a related question which can be used to first ensure properly aligned memory owned by the std::vector
:
Modern approach to making std::vector allocate aligned memory
That combined with __attribute__((aligned(ALIGNMENT_IN_BYTES)))
added to the method parameters (pointers) seems to do the trick. Example:
void Process(__attribute__((aligned(ALIGNMENT_IN_BYTES))) const uint8_t* p_source1,
__attribute__((aligned(ALIGNMENT_IN_BYTES))) const uint8_t* p_source2,
__attribute__((aligned(ALIGNMENT_IN_BYTES))) uint8_t* p_destination,
const int count)
{
for (int i = 0; i < count; i++)
p_destination[i] = p_source1[i] + p_source2[i];
}
That seems to compile nicely (checked in Godbolt) so the compiler clearly assumes it can simply use large registers to process the data with SIMD instructions.
Thank you everyone!