Eigen::Vector3f alignment

I'm using Eigen to process an unstructured point set (point cloud), represented as an array of Eigen::Vector3f objects. In order to enable SIMD vectorization I've subclassed Vector3f into a class with alignas(16). The objects in the array each start at a 16 byte boundary, have gaps of 4 bytes between each other, and contain uninitialized data.

The subclass looks currently like this: (Stil need to add template copy constructor and operator= as indicated in Eigen documentation)

struct alignas(16) point_xyz : public Eigen::Vector3f {
    using Eigen::Vector3f::Vector3f;
};

point_xyz cloud[n];

Assembly output shows that SIMD instructions are being used and the a program which applies a transformation on each point_xyz in the array seems to work correctly.

Is it safe to use Eigen that way, or do the results depend on the contents of the unused 4 byte gaps, etc?

Also, would it be safe to put RGB color data or other into the unused 4 bytes (required overriding the memory alignment)?

Edit: It seems that both clang++ and g++ do some vectorization when optimization is enabled. Without optimization (and below -O2 for clang++), both generate a call to a Eigen library function for the following matrix multiplication (transformation):

using transform_t = Eigen::Transform<float, 3, Eigen::Affine>;
transform_t t = Eigen::AngleAxisf(0.01*M_PI, Eigen::Vector3f::UnitX()) * Eigen::Translation3f(0.1, 0.1, 0.1);
Eigen::Vector3f p(123, 234, 345);
std::cout << p << std::endl;

for(;;) {
  asm("# BEGIN TRANS");
  p = t * p;
  asm("# END TRANS");
}
std::cout << p << std::endl;

(The for loop and cout are needed so that the optimization doesn't remove the multiplication or put in a constant value).

In GCC (-O1) it results in

# 50 "src/main.cc" 1
    # BEGIN TRANS
# 0 "" 2
    movss   (%rsp), %xmm4
    movaps  %xmm4, %xmm2
    mulss   64(%rsp), %xmm2
    movss   4(%rsp), %xmm0
    movaps  %xmm0, %xmm1
    mulss   80(%rsp), %xmm1
    addss   %xmm1, %xmm2
    movss   8(%rsp), %xmm3
    movaps  %xmm4, %xmm5
    mulss   68(%rsp), %xmm5
    movaps  %xmm0, %xmm1
    mulss   84(%rsp), %xmm1
    addss   %xmm5, %xmm1
    movaps  %xmm3, %xmm5
    mulss   100(%rsp), %xmm5
    addss   %xmm5, %xmm1
    addss   116(%rsp), %xmm1
    mulss   72(%rsp), %xmm4
    mulss   88(%rsp), %xmm0
    addss   %xmm4, %xmm0
    movaps  %xmm3, %xmm4
    mulss   104(%rsp), %xmm4
    addss   %xmm4, %xmm0
    addss   120(%rsp), %xmm0
    mulss   96(%rsp), %xmm3
    addss   %xmm3, %xmm2
    addss   112(%rsp), %xmm2
    movss   %xmm2, (%rsp)
    movss   %xmm1, 4(%rsp)
    movss   %xmm0, 8(%rsp)
# 52 "src/main.cc" 1
    # END TRANS
# 0 "" 2

It results in the same output with and without #define EIGEN_DONT_VECTORIZE 1. With Vector4f, a slightly shorter output is generated when Eigen's vectorization is not disabled, but both operate on the xmm registers.

AlignedVector3<float> doesn't seem to support the multiplication with Eigen::Transform. I'm doing affine transformations on sets of points, represented using 3 (non-homogenuous) coordinates. I'm not sure how Eigen implements the transformation with Eigen::Transform<float, 3, Eigen::Affine> of a Eigen::Vector4f vector. I.e. does it only change the first 3 components of the vector, and does the fourth component have to be zero, or can it contain an arbitrary value, or does it interpret the 4-vector as homogenous coordinates? And does it depend on the internal representation of the transformation (Affine, AffineCompact, Projective).

Solution

Don't do this! Use Aligned Vector3 unsupported module instead. For example:

#include <unsupported/Eigen/AlignedVector3>
// ...
// and use it somewhere in the code:
Eigen::AlignedVector3<double> a, b;
// ...
a += b;// will use simd instruction, even known they aren't real 3d vectors.

The way that this class works is by storing internally a 4d vector(which is aligned by Eigen rules), when the latest coefficient isn't being used.

This class made it transparent to the user. Use it as the same way you used a normal 3d vector, simple as that!

The asm in the question is not using SIMD, just scalar FP operations in vector registers like normal for x86-64. (And for x86-32 when SSE is available). addss is an FP add Scalar Single-precision. (addps is add packed single).

Auto-vectorization is only enabled in gcc at -O3, not -O2, and this only used -O1.

The uninitialized element in each 16B chunk mostly prevents SIMD floating point, because they might contain denormals or NaNs that slow down math instructions by more than an order of magnitude (unless you enable Denormals Are Zero and Flush To Zero, and the compiler knows that and can take advantage of it). Integer instructions don't have this weakness of performance problems with garbage in some elements, but don't expect compilers to auto-vectorize even with that.