Search code examples
c++optimizationc++17stdvectorsimd

A way to ensure std::vector is always aligned for optimal SIMD execution?


I want to have X amount of std::vectors of equal size, which I can be processed together in a for loop which goes from start to finish in a linear fashion. For example:

for (int i = 0; i < vector_length; i++)
    vector1[i] = vector2[i] + vector3[i] * vector4[i];

I want all this to take full advantage of SIMD instructions. For this to happen, the compiler should be able to assume that each of the vectors are aligned optimally for __m256 use. If the compiler can't assume this, all sorts of non-optimal loops can be generated and used in the code.

How do I ensure this optimal alignment of std::vectors and optimal code generation for such aligned data?

It can be assumed that each vector has identical data structures inside, which can be added/multiplied together using standard SIMD instructions.

I'm using C++17.

MORE INFORMATION AS REQUESTED BY THE PEOPLE HERE:

32 bytes of alignment is good for my use.

I want to get this running on Intel Macs and PCs. (Xcode + Visual Studio) and later on ARM CPU Macs when I get one of those computers (Xcode again).


Solution

  • As couple of people pointed out, there's a related question which can be used to first ensure properly aligned memory owned by the std::vector:

    Modern approach to making std::vector allocate aligned memory

    That combined with __attribute__((aligned(ALIGNMENT_IN_BYTES))) added to the method parameters (pointers) seems to do the trick. Example:

    void Process(__attribute__((aligned(ALIGNMENT_IN_BYTES))) const uint8_t* p_source1,
                 __attribute__((aligned(ALIGNMENT_IN_BYTES))) const uint8_t* p_source2,
                 __attribute__((aligned(ALIGNMENT_IN_BYTES))) uint8_t*       p_destination,
                 const int      count)
    {
        for (int i = 0; i < count; i++)
            p_destination[i] = p_source1[i] + p_source2[i];
    }
    

    That seems to compile nicely (checked in Godbolt) so the compiler clearly assumes it can simply use large registers to process the data with SIMD instructions.

    Thank you everyone!