Using a list of linear indices to map between an input and output matrix in Eigen

Say I have 3 2D matrices:

shiftedData - Complex double of size [tSize*TXSize,xSize]
Data - Complex double of size [tSize*TXSize,xSize]
indices - integer of size [tSize,xSize]

I want to map every sub-block of Data into shiftedData according to the linear indices defined in the matrix indices. In Matlab, I might do the following:

shiftedData = zeros(tSize*TXSize,xSize);
for nT = 0:TXSize-1
shiftedData( (nT*tSize+1):((nT+1)*tSize),:) = ...
            extractData(Data((nT*tSize+1):((nT+1)*tSize),:), ...
            indices(:,(nR*xSize+1):((nR+1)*xSize))+1,xSize,tSize);
end

function [shiftedData] = extractData(Data,indices,xSize,tSize)

    shiftedData(indices(:)) = Data(:);

end

I am trying to implement something similar in Eigen. I've written out 4 different approaches for the analogous extractData function and have shown them in pseudocode below along with the line calling them from a for loop:

Eigen::MatrixXcd shiftedData = Eigen::MatrixXcd::Zero(tSize*TXSize,xSize);
for (int nT = 0; nT < TXSize; nT++){
   extractData(shiftedData(Eigen::seq(nT*tSize,(nT+1)*tSize-1),Eigen::all), 
   Data(Eigen::seq(nT*tSize,(nT+1)*tSize-1),Eigen::all), indices, xSize, tSize);
}


void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
            const Eigen::Ref<const Eigen::MatrixXcd>& Data, 
            const Eigen::Ref<const Eigen::MatrixXi>& indices, 
            const long int& xSize, const long int& tSize) {

    // Approach 1
    for (int nt = 0; nt < tSize; nt++){
        for (int nx = 0; nx < xSize; nx++){
            int row = indices(nt,nx) % tSize;
            int col = indices(nt,nx) / tSize;
            shiftedData(row,col) = Data(nt,nx);
        }
    }

    // Approach 2
    for (int i = 0; i < tSize*xSize; i++) {
        int nt = i % tSize;
        int nx = i / tSize;
        int row = indices(nt,nx) % tSize;
        int col = indices(nt,nx) / tSize;
        shiftedData(row,col) = Data(nt,nx);
    }


    // Approach 3
    Eigen::Map<const Eigen::VectorXcd> linData(Data.data(), xSize*tSize);
    Eigen::Map<const Eigen::VectorXi> linIDX(indices.data(), xSize*tSize);
    for (int i = 0; i < xSize*tSize; i++){
        int col = linIDX(i) / tSize;
        int row = linIDX(i) % tSize;
        shiftedData(row,col) = linData(i);
    }

    // Approach 4
    Eigen::Map<const Eigen::VectorXcd> linData(Data.data(), xSize*tSize);
    Eigen::Map<const Eigen::VectorXi> linIDX(indices.data(), xSize*tSize);
    Eigen::Map<Eigen::VectorXcd> linView(shiftedData.data(), xSize*tSize);
    linView(linIDX) = linData;

}

Approaches 1 and 2 generate the correct output. However, approaches 3 and 4 do not. I believe the problem has something to do with the fact that I am passing non-contiguous blocks of shiftedData and Data to extractData. How can I properly account for this and perform something more analogous to the Matlab example in Eigen?

Solution

I'll focus on improving your first approach as it seems the most straightforward. As you suspected, Map cannot work. You can use Reshape but that's just going to hide the division and modulo inside the Eigen method. Let's try some different methods.

First, some minor cleanup. I'm going to ignore the outer loop over TXSize entirely and focus on the extractData function. There are minor issues with the interface:

You pass two dimensions as const long int&, which is pointless, creating a reference to a simple integer just risks losing performance (potential risk of redundant reloads because of aliasing). Just pass them by value
long int itself is not the best type for this. From your linked CodeReview I know that you operate on Windows, which means long int is basically the same as int. Prefer using a type that is the correct size on all platforms. When you use Eigen, use Eigen::Index, which is std::ptrdiff_t by default and is probably what you expect to get when you type long int.
Personally I would not pass those two arguments at all. The function can simply call the rows() and cols() methods on indices or Data. Dropping those explicit arguments means two less potential sources of error

Back to the main course. For testing, I use the dimensions from your CodeReview page, meaning tSize = 2688; xSize = 192

Iteration order

For approach 1, your loop is

    for (int nt = 0; nt < tSize; nt++){
        for (int nx = 0; nx < xSize; nx++){
            ...
            ... = indices(nt,nx) ...;
            ... = Data(nt,nx);
        }
    }

Notice how you iterate over columns in the inner loop and rows in the outer loop. Eigen uses a column-major format by default which makes this order sub-optimal with poor locality of data.

Assuming that all entries in indices are unique, we can swap the order of the loop headers. On my system this gives me a 50% improvement.

void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
            const Eigen::Ref<const Eigen::MatrixXcd>& Data,
            const Eigen::Ref<const Eigen::MatrixXi>& indices,
            Eigen::Index xSize, Eigen::Index tSize) {

    for (Eigen::Index nx = 0; nx < xSize; nx++){
        for (Eigen::Index nt = 0; nt < tSize; nt++){
            Eigen::Index row = indices(nt,nx) % tSize;
            Eigen::Index col = indices(nt,nx) / tSize;
            shiftedData(row,col) = Data(nt,nx);
        }
    }
}

Avoiding redundant divisions

You reuse the same indices for multiple matrices. This means you do repeated integer divisions. Integer division is very slow with a throughput of 6-14 cycles per instruction on many CPUs. It may be worth caching that.

void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
            const Eigen::Ref<const Eigen::MatrixXcd>& Data,
            const Eigen::Ref<const Eigen::Array2Xi>& indices,
            Eigen::Index xSize, Eigen::Index tSize) {

    for (Eigen::Index nx = 0; nx < xSize; nx++){
        for (Eigen::Index nt = 0; nt < tSize; nt++){
            Eigen::Index flatIdx = nx * tSize + nt;
            const auto outIndex = indices.col(flatIdx);
            shiftedData(outIndex.x(), outIndex.y()) = Data(nt, nx);
        }
    }
}

...
    Eigen::Array2Xi splitIndices(2, xSize * tSize);
    for(Eigen::Index i = 0; i < xSize; ++i) {
        for(Eigen::Index j = 0; j < tSize; ++j) {
            auto out = splitIndices.col(i * tSize + j);
            const int flatIdx = indices(j, i);
            out.x() = flatIdx % tSize;
            out.y() = flatIdx / tSize;
        }
    }
    for(int i = 0; i < TXSize; ++i)
        extractData(ShiftedData, Data, splitIndices, xSize, tSize);

However, on my system I don't see a benefit. I assume the cost of the random memory access on shiftedData dominates.

Prefer random loads over stores

In general, CPUs are better at handling random memory loads over stores. We can invert the indices and combine this with the cached index computation above.

void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData,
            const Eigen::Ref<const Eigen::MatrixXcd>& Data,
            const Eigen::Ref<const Eigen::Array2Xi>& indices,
            Eigen::Index xSize, Eigen::Index tSize) {

    for (Eigen::Index nx = 0; nx < xSize; nx++){
        for (Eigen::Index nt = 0; nt < tSize; nt++){
            Eigen::Index flatIdx = nx * tSize + nt;
            const auto inIndex = indices.col(flatIdx);
            shiftedData(nt, nx) = Data(inIndex.x(), inIndex.y());
        }
    }
}

...
    Eigen::Array2Xi inverseIndices(2, xSize * tSize);
    for(Eigen::Index i = 0; i < xSize; ++i) {
        for(Eigen::Index j = 0; j < tSize; ++j) {
            int flatIdx = indices(j, i);
            Eigen::Index row = flatIdx % rows;
            Eigen::Index col = flatIdx / rows;
            auto out = inverseIndices.col(col * tSize + row);
            out.x() = j;
            out.y() = i;
        }
    }
    for(int i = 0; i < TXSize; ++i)
        extractData(ShiftedData, Data, inverseIndices, cols, rows);

On my particular platform (TigerLake) I see only a minimal, positive effect. However, when I reduce the size of the matrix to something that fits comfortably into L2 cache (128 x 192), the performance improves by 18% over the cached indices above and the cached indices have a 7% advantage over not caching the division.

Possible alternatives

If the index pattern works out for it, Eigen implements multiplications with a PermutationMatrix or Transpositions but this can only shuffle full rows or columns at once.