What is the best way to use Eigen lazy evalution to multiply each row of an Nx3 matrix by a different 3x3 rotation matrix?

Pretty much the title. I'm trying to think up a good way to formulate this without creating any temporaries so I can actually make use of Eigen's speediness. The best I could come up with is this formulation:

Essentially, I end up with a 3nx3n matrix being multiplied by a 3nxn matrix, and I end up with a 3nxn matrix of the transformed coordinates. Is this the most efficient approach? I don't love that there is the extra overhead (both performance and memory-wise) of constructing these full matrices.

Is there a way to maybe use an Eigen::Map on a contiguous set of data, e.g. [a0, b0, c0, d0, e0, f0, g0, h0, i0, a1, b1, c1, d1,...], and have it behave like the sparse matrix below?

Additionally, if my above approach is the most sensible way, is there an efficient way to collapse the resulting matrix into a non-sparse 3xN matrix of the results?

Solution

This will not work the way you hope. The main issue is that for a dense matrix format, you would vastly increase the number of operations -- most of them being zero -- while for sparse matrix products, Eigen doesn't have a block-diagonal sparse format (though we can build one if we want.

Instead, Eigen can optimize well for small fixed-size matrices or blocks of larger matrices. It is much better to use something like Matrix3f or Matrix3Xf than a MatrixXf that happens to be size 3 in one or both dimensions at runtime.

Therefore a good choice is to do something like this:

    int rotation_count = ...;
    Eigen::Matrix3Xf rotations = Eigen::Matrix3Xf::Random(3, rotation_count * 3);
    Eigen::Matrix3Xf positions = Eigen::Matrix3Xf::Random(3, rotation_count);
    Eigen::Matrix3Xf out(3, rotation_count);
#   pragma omp parallel for
    for(Eigen::Index i = 0; i < rotation_count; ++i)
        out.col(i).noalias() =
              rotations.middleCols<3>(i * 3) * positions.col(i);

Note two points in particular:

The number of rows is fixed instead of the number of columns. With Eigen's column-major matrix format, this means middleCols refers to a consecutive 3x3 segment instead of having a large stride from one column to the next.
We use the template overload of middleCols to create a block that carries the compile-time information that it is 3x3

We can further improve this by padding to 4x4 matrices. This wastes a bit of memory but it allows better vectorization since 4 floats fill one SSE register or 4 doubles fill one AVX register.

    int rotation_count = ...;
    // With real data, keep the 4-th column and every 4-th row zero
    Eigen::Matrix4Xf rotations = Eigen::Matrix4Xf::Random(4, rotation_count * 4);
    Eigen::Matrix4Xf positions = Eigen::Matrix4Xf::Random(4, rotation_count);
    Eigen::Matrix4Xf out(4, rotation_count);
#   pragma omp parallel for
    for(Eigen::Index i = 0; i < rotation_count; ++i)
        out.col(i).noalias() =
              rotations.middleCols<4>(i * 4) * positions.col(i);

On my system, the 3x3 implementation does 100,000 repetitions with 10,000 rotation matrices in 3.7 seconds and the 4x4 implementation takes 2,6 seconds. Make sure to compile with -DNDEBUG for production to remove the cost of assertions.

Alternatively you could try and get Eigen's unsupported tensor module to work for this but I'm not familiar with its use.