I'm using Eigen to process an unstructured point set (point cloud), represented as an array of Eigen::Vector3f
objects. In order to enable SIMD vectorization I've subclassed Vector3f
into a class with alignas(16). The objects in the array each start at a 16 byte boundary, have gaps of 4 bytes between each other, and contain uninitialized data.
The subclass looks currently like this: (Stil need to add template copy constructor and operator=
as indicated in Eigen documentation)
struct alignas(16) point_xyz : public Eigen::Vector3f {
using Eigen::Vector3f::Vector3f;
};
point_xyz cloud[n];
Assembly output shows that SIMD instructions are being used and the a program which applies a transformation on each point_xyz
in the array seems to work correctly.
Is it safe to use Eigen that way, or do the results depend on the contents of the unused 4 byte gaps, etc?
Also, would it be safe to put RGB color data or other into the unused 4 bytes (required overriding the memory alignment)?
Edit: It seems that both clang++ and g++ do some vectorization when optimization is enabled. Without optimization (and below -O2 for clang++), both generate a call to a Eigen library function for the following matrix multiplication (transformation):
using transform_t = Eigen::Transform<float, 3, Eigen::Affine>;
transform_t t = Eigen::AngleAxisf(0.01*M_PI, Eigen::Vector3f::UnitX()) * Eigen::Translation3f(0.1, 0.1, 0.1);
Eigen::Vector3f p(123, 234, 345);
std::cout << p << std::endl;
for(;;) {
asm("# BEGIN TRANS");
p = t * p;
asm("# END TRANS");
}
std::cout << p << std::endl;
(The for loop and cout are needed so that the optimization doesn't remove the multiplication or put in a constant value).
In GCC (-O1) it results in
# 50 "src/main.cc" 1
# BEGIN TRANS
# 0 "" 2
movss (%rsp), %xmm4
movaps %xmm4, %xmm2
mulss 64(%rsp), %xmm2
movss 4(%rsp), %xmm0
movaps %xmm0, %xmm1
mulss 80(%rsp), %xmm1
addss %xmm1, %xmm2
movss 8(%rsp), %xmm3
movaps %xmm4, %xmm5
mulss 68(%rsp), %xmm5
movaps %xmm0, %xmm1
mulss 84(%rsp), %xmm1
addss %xmm5, %xmm1
movaps %xmm3, %xmm5
mulss 100(%rsp), %xmm5
addss %xmm5, %xmm1
addss 116(%rsp), %xmm1
mulss 72(%rsp), %xmm4
mulss 88(%rsp), %xmm0
addss %xmm4, %xmm0
movaps %xmm3, %xmm4
mulss 104(%rsp), %xmm4
addss %xmm4, %xmm0
addss 120(%rsp), %xmm0
mulss 96(%rsp), %xmm3
addss %xmm3, %xmm2
addss 112(%rsp), %xmm2
movss %xmm2, (%rsp)
movss %xmm1, 4(%rsp)
movss %xmm0, 8(%rsp)
# 52 "src/main.cc" 1
# END TRANS
# 0 "" 2
It results in the same output with and without #define EIGEN_DONT_VECTORIZE 1
. With Vector4f
, a slightly shorter output is generated when Eigen's vectorization is not disabled, but both operate on the xmm registers.
AlignedVector3<float>
doesn't seem to support the multiplication with Eigen::Transform
. I'm doing affine transformations on sets of points, represented using 3 (non-homogenuous) coordinates. I'm not sure how Eigen implements the transformation with Eigen::Transform<float, 3, Eigen::Affine>
of a Eigen::Vector4f
vector. I.e. does it only change the first 3 components of the vector, and does the fourth component have to be zero, or can it contain an arbitrary value, or does it interpret the 4-vector as homogenous coordinates? And does it depend on the internal representation of the transformation (Affine
, AffineCompact
, Projective
).
Don't do this! Use Aligned Vector3 unsupported module instead. For example:
#include <unsupported/Eigen/AlignedVector3>
// ...
// and use it somewhere in the code:
Eigen::AlignedVector3<double> a, b;
// ...
a += b;// will use simd instruction, even known they aren't real 3d vectors.
The way that this class works is by storing internally a 4d vector(which is aligned by Eigen rules), when the latest coefficient isn't being used.
This class made it transparent to the user. Use it as the same way you used a normal 3d vector, simple as that!
The asm in the question is not using SIMD, just scalar FP operations in vector registers like normal for x86-64. (And for x86-32 when SSE is available). addss
is an FP add Scalar Single-precision. (addps
is add packed single).
Auto-vectorization is only enabled in gcc at -O3
, not -O2
, and this only used -O1
.
The uninitialized element in each 16B chunk mostly prevents SIMD floating point, because they might contain denormals or NaNs that slow down math instructions by more than an order of magnitude (unless you enable Denormals Are Zero and Flush To Zero, and the compiler knows that and can take advantage of it). Integer instructions don't have this weakness of performance problems with garbage in some elements, but don't expect compilers to auto-vectorize even with that.