Better way to implement ATA when A is lower triangular matrix

In Eigen library to implement A^T*A one can write:

X.template triangularView<Lower>().setZero(); 
X.template selfadjointView<Lower>().rankUpdate(A.transpose());

Is there any better (more efficient) way to write it, if A is a lower triangular matrix? I tried following, but it gives compilation error:

X.template selfadjointView<Lower>().rankUpdate(A.template triangularView<Lower>().transpose());

It gives error:

 error: no matching member function for call to 'rankUpdate'

Solution

Eigen typically includes only those matrix routines that have a corresponding function in BLAS. There is no BLAS routine for this particular case. One additional wrinkle is that Eigen's implementation of rankUpdate and triangular matrix multiplication are not parallelized unless you define -DEIGEN_USE_BLAS and link against a BLAS library that does this, e.g. OpenBLAS.

I did some testing with various implementations.

rankUpdate

  void trtr(Eigen::MatrixXf& out, const Eigen::MatrixXf& in)
  {
    out = Eigen::MatrixXf::Zero(in.rows(), in.cols());
    out.selfadjointView<Eigen::Lower>().rankUpdate(in.transpose());
    out.triangularView<Eigen::StrictlyUpper>() = out.transpose();
  }

Only parallelized with OpenBLAS. Then it is pretty fast, otherwise somewhat slow.

Triangular matrix product

  void trtr(Eigen::MatrixXf& out, const Eigen::MatrixXf& in)
  { out.noalias() = in.transpose() * in.triangularView<Eigen::Lower>(); }

Only parallelized with OpenBLAS. Even then it is somewhat slow.

General matrix product

  void trtr(Eigen::MatrixXf& out, const Eigen::MatrixXf& in)
  { out.noalias() = in.transpose() * in; }

Always parallelized. Consistently fast.

Custom blockwise multiplication

I tried implementing my own version that splits the matrix in blocks to remove redundant operations. This is somewhat faster for large matrices, especially when using OpenBLAS.

It is not well tuned for small matrices.

  void trtr_recursive
  (Eigen::Ref<Eigen::MatrixXf> out,
   const Eigen::Ref<const Eigen::MatrixXf>& in) noexcept
  {
    Eigen::Index size = out.rows();
    if(size <= 16) {
      /*
       * Two notes:
       * 1. The cutoff point is a possible tuning parameter
       * 2. This is the only place where we assume that the upper triangle
       *    is initialized to zero. We can remove this assumption by making
       *    a copy of the input into a fixed size matrix. Use Eigen::Matrix
       *    with the MaxRows and MaxCols template argument for this
       */
      out.selfadjointView<Eigen::Lower>().rankUpdate(in.transpose());
      return;
    }
    Eigen::Index full = (size + 1) >> 1;
    Eigen::Index part = size >> 1;
    const auto& fullin = in.bottomLeftCorner(full, full);
    const auto& bottomright = in.bottomRightCorner(part, part);
    const auto& bottomleft = in.bottomLeftCorner(part, full);
    out.topLeftCorner(full, full).selfadjointView<Eigen::Lower>()
      .rankUpdate(fullin.transpose());
    out.bottomLeftCorner(part, full).noalias() +=
      bottomright.transpose().triangularView<Eigen::Upper>() *
      bottomleft;
    trtr_recursive(out.topLeftCorner(part, part),
                   in.topLeftCorner(part, part));
    trtr_recursive(out.bottomRightCorner(part, part),
                   bottomright);
  }
  void trtr(Eigen::MatrixXf& out, const Eigen::MatrixXf& in)
  {
    out = Eigen::MatrixXf::Zero(in.rows(), in.cols());
    trtr4_recursive(out, in);
    out.triangularView<Eigen::StrictlyUpper>() = out.transpose();
  }