How can I sum sub-matrices in Eigen

I have some matrix defined as:

Eigen::MatrixXd DPCint = Eigen::MatrixXd::Zero(p.szZ*(p.na-1),p.szX);

\\ perform some computations and fill every sub-matrix of size [p.szZ,p.szX] with some values
#pragma omp parallel for
for (int i=0; i < p.na-1; i++)
{
...
DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all) = ....;
}

\\ Now sum every p.szZ rows to get a matrix that is [p.szZ,p.szX]

In Matlab, this operation is fast and trivial. I can't simply do a += operation here if I want to parallelize the loop with OpenMP. Similarly, I can loop through and sum every set of p.szZ rows, but that loop cannot be parallelized since each thread would be outputting to the same data. Is there some efficient way to use Eigen's indexing operations to sum sub-matrices? This seems like a simple operation and I feel like I'm missing something, but I haven't been able to find a solution for some time.

Clarification

Essentially, after the above loop, I'd like to do this in a single line:

for (int i = 0; i < p.na-1; i++)
{
DPC += DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ-1),Eigen::all);
}

In matlab, I can simply reshape the matrix into a 3D matrix and sum along the third dimension. I'm not familiar with Eigen's tensor library, and I hope this operation is doable without resorting to using the tensor library. However, my priority is speed and efficiency, so I'm open to any suggestions.

Solution

Here is my take.

#pragma omp parallel
{
     /*
      * We force static schedule to prevent excessive cache-line bouncing
      * because the elements per thread are not consecutive.
      * However, most (all?) OpenMP implementations use static scheduling
      * by default anyway.
      * Switching to threads initializing full columns would be
      * more effective from a memory POV.
      */
#    pragma omp for schedule(static)
     for(int i=0; i < p.na-1; i++) {
         /*
          * Note: The original code looks wrong.
          * Remember that indices in Eigen (as with most things C++)
          * are exclusive on the end. This touches
          * [start, end), not [start, end]
          */
         DPCint(Eigen::seq(i*p.szZ,(i+1)*p.szZ),Eigen::all) = ...;
         /*
          * Same as
          * DPCint.middleRows(i*p.szZ, p.szZ) = ...
          */
     }
     /*
      * We rely on the implicit barrier at the end of the for-construct
      * for synchronization. Then start a new loop in the same parallel
      * construct. This one can be nowait as it is the last one.
      * Again, static scheduling limits cache-line bouncing to the first
      * and last column/cache line per thread.
      * But since we wrote rows per thread above and now read
      * columns per thread, there are still a lot of cache misses
      */
#    pragma omp for schedule(static) nowait
     for(int i=0; i < p.szX; i++) {
         /*
          * Now we let a single thread reduce a column.
          * Not a row because we deal with column-major matrices
          * so this pattern is more cache-efficient
          */
         DPC.col(i) += DPCint.col(i).reshaped(
               p.szZ, p.na - 1).rowwise().sum(); 
     }
}

Reshaping is new in Eigen-3.4. However, I noticed that the resulting assembly isn't particularly effective (no vectorization).

Rowwise reductions have always been somewhat slow in Eigen. So we might do better like this, which also works in Eigen-3.3:

#    pragma omp for schedule(static) nowait
     for(int i = 0; i < p.szX; i++) {
         const auto& incol = DPCint.col(i);
         auto outcol = DPC.col(i);
         for(int j = 0; j < p.na - 1; j++)
             outcol += incol.segment(j * (p.na - 1), p.na - 1); 
     }

Alternatively, multiplying the reshaped matrix with an all-ones vector also works surprisingly well. It needs benchmarking but, especially with Eigen using OpenBLAS, it could be faster than rowwise summation.

Benchmarking

Okay, I went ahead and did some tests. First, let's set up a minimum reproducible example because we didn't have one before.

void reference(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
#   pragma omp parallel for
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint(a * szZ + z, x) =
                      a * 0.25 + x * 1.34 + z * 12.68;
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}
void test(Eigen::Ref<Eigen::MatrixXd> DPC,
          int na)
{...}
int main()
{
    const int szZ = 500, szX = 192, na = 15;
    const int repetitions = 10000;
    Eigen::MatrixXd ref = Eigen::MatrixXd::Zero(szZ, szX);
    Eigen::MatrixXd opt = Eigen::MatrixXd::Zero(szZ, szX);
    reference(ref, na);
    test(opt, na);
    std::cout << (ref - opt).cwiseAbs().sum() << std::endl;
    for(int i = 0; i < repetitions; ++i)
        test(opt, na);
}

The array dimensions are as described by OP. The DPCint initialization was chosen to be scalar and allow testing that any optimized implementation is still correct. The number of repetitions was picked for reasonable runtime.

Compiled and tested with g++-10 -O3 -march=native -DNDEBUG -fopenmp on an AMD Ryzen Threadripper 2990WX (32 core, 64 thread). NUMA enabled. Using Eigen-3.4.0.

The reference gives 16.6 seconds.

Let's optimize the initialization to get this out of the way:

void reference_op1(Eigen::Ref<Eigen::MatrixXd> DPC,
                   int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for collapse(2)
    for(Eigen::Index a = 0; a < na; ++a)
        for(Eigen::Index x = 0; x < szX; ++x)
            DPCint.col(x).segment(a * szZ, szZ) = zvals.array() + xvals[x] + avals[a];
    for(Eigen::Index a = 0; a < na; ++a)
        DPC += DPCint.middleRows(a * szZ, szZ);
}

The linspaced isn't really helping but notice the collapse(2). Since na is only 15 on a 64 thread machine, we need to parallelize over two loops. 15.4 seconds

Let's test my proposed version:

void rowwise(Eigen::Ref<Eigen::MatrixXd> DPC,
             int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
              DPC.col(x) += DPCint.col(x).reshaped(szZ, na).rowwise().sum();
    }
}

Runs at 12.5 seconds. Not a lot of speedup given that we just parallelized the second half of our algorithm.

As I suggested earlier, rowwise reductions are crap and can be avoided with matrix-vector products. Let's see if this helps here:

void rowwise_dot(Eigen::Ref<Eigen::MatrixXd> DPC,
                 int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
    const Eigen::VectorXd ones = Eigen::VectorXd::Ones(szZ);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x).noalias() +=
                  DPCint.col(x).reshaped(szZ, na) * ones;
    }
}

Nope, still 12.5 seconds. What happens when we compile with -DEIGEN_USE_BLAS -lopenblas_openmp? Same number. Might be worth it if you cannot compile for AVX2 but the CPU supports it. Eigen has no support for runtime CPU feature detection. Or it might help with float more than with double because the benefit of vectorization is higher.

What if we build our own rowwise reduction in a way that vectorizes?

void rowwise_loop(Eigen::Ref<Eigen::MatrixXd> DPC,
                  int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const auto avals = Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const Eigen::VectorXd zvals =
          Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for collapse(2)
        for(Eigen::Index a = 0; a < na; ++a)
            for(Eigen::Index x = 0; x < szX; ++x)
                DPCint.col(x).segment(a * szZ, szZ) =
                      zvals.array() + xvals[x] + avals[a];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index a = 0; a < na; ++a)
                DPC.col(x) += DPCint.col(x).segment(a * szZ, szZ);
    }
}

13.3 seconds. Note that on my laptop (Intel i7-8850H), this was significantly faster than the rowwise version. NUMA and cache line bouncing may be a serious issue on the larger threadripper but I didn't investigate perf counters.

Reordering DPCint

At this point I think it becomes apparent that the layout of the DPCint and the loop ordering in its setup are a liability. Maybe there is a reason for it. But if there isn't, I propose changing it as follows:

void reordered(Eigen::Ref<Eigen::MatrixXd> DPC,
               int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
#       pragma omp for
        for(Eigen::Index x = 0; x < szX; ++x)
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(x).segment(z * na, na) =
                      avals.array() + xvals[x] + zvals[z];

#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x)
            DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

The idea is to reshape it in such a way that a) colwise sums are possible and b) The same thread touches the same elements in the first and second loop.

Interestingly, this seems slower at 15.3 seconds. I guess the innermost assignment is now too short.

What happens if we fold both parts of the algorithm into one loop, reducing the synchronization overhead and improving caching?

void reordered_folded(Eigen::Ref<Eigen::MatrixXd> DPC,
                        int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    Eigen::MatrixXd DPCint(szZ * na, szX);
    const Eigen::VectorXd avals =
          Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel for
    for(Eigen::Index x = 0; x < szX; ++x) {
        for(Eigen::Index z = 0; z < szZ; ++z)
            DPCint.col(x).segment(z * na, na) =
                  avals.array() + xvals[x] + zvals[z];
        DPC.col(x) += DPCint.col(x).reshaped(na, szZ).colwise().sum();
    }
}

12.3 seconds. At this point, why do we even have a shared DPCint array? Let's use a per-thread matrix.

void reordered_loctmp(Eigen::Ref<Eigen::MatrixXd> DPC,
                      int na)
{
    const Eigen::Index szZ = DPC.rows();
    const Eigen::Index szX = DPC.cols();
    const Eigen::VectorXd avals =
        Eigen::VectorXd::LinSpaced(na, 0., (na - 1) * 0.25);
    const auto xvals = Eigen::VectorXd::LinSpaced(szX, 0., (szX - 1) * 1.34);
    const auto zvals = Eigen::VectorXd::LinSpaced(szZ, 0., (szZ - 1) * 12.68);
#   pragma omp parallel
    {
        Eigen::MatrixXd DPCint(na, szZ);
#       pragma omp for nowait
        for(Eigen::Index x = 0; x < szX; ++x) {
            for(Eigen::Index z = 0; z < szZ; ++z)
                DPCint.col(z) = avals.array() + xvals[x] + zvals[z];
            DPC.col(x) += DPCint.colwise().sum();
        }
    }
}

Heureka! 6.8 seconds. We eliminated cache-line bounding. We made everything cache-friendly and properly vectorized.

The only thing I can think of now is turning DPCint into an expression that is evaluated on the fly but this very much depends on the actual expression. Since I cannot speculate on that, I'll leave it at that.