Search code examples
sortingmatrixeigen

How to sort the Eigen::Matrix that have std::pair<int, int> and delete the duplicates of indices


I need a way to sort and delete the duplicates of row and column indices of a sparse matrix. But at first I need to sort indices and later find a method to delete the duplicates.

Eigen::Matrix<std::pair<int, int>, Eigen::Dynamic, Eigen::Dynamic> unsorted_indices; 

How can write a sort method for this? or method that I can achieve the both at once.


Solution

  • I can easily achieve this with std::set<pair<int,int>> […] When unknows grow over a million then things are very costly

    Matrix is the wrong data structure since you apparently make no use of it being two-dimensional. If std::set works for you, you can use a plain old std::vector

    #include <algorithm>
    
    std::vector<std::pair<int, int>> indices;
    for(...)
      indices.emplace_back(row_i, col_j);
    std::sort(indices.begin(), indices.end());
    indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
    indices.shrink_to_fit();
    

    vector is more memory efficient than std::set (for the cost of a single entry in std::set including allocator overhead you can put 6 entries into a vector). The erase(unique(…) stuff is known as the erase-remove idiom

    Hash table

    If the number of duplicates is large, removing them early with a set-like structure may be better. However, std::set has a relatively high memory overhead and costly O(log(n)) insertion cost. std::unordered_set is more efficient with O(1) and has slightly lower memory usage but it requires you to write your own hash function.

    Unless you need to avoid external libraries, use a flat hash table. They have a much lower memory overhead. Boost comes with one that supports std::pair out-of-the-box:

    #include <boost/unordered/unordered_flat_set.hpp>
    
    boost::unordered_flat_set<std::pair<int, int>> set;
    for(...)
      set.insert({row_i, col_j});
    std::vector<std::pair<int, int>> indices(set.begin(), set.end());
    std::sort(indices.begin(), indices.end());
    

    Bitmap

    As an alternative, you can use a simple bitmap. The memory consumption is then fixed to rows * cols / 8 byte as opposed to the at least 8 byte per entry in the hash table (not accounting for the hash table's overhead). So if your matrix is filled at least 1/64 (1.5%), the bitmap is more memory-efficient. Scanning it for set bits also automatically produces a sorted sequence, saving us this step.

    #include <bit>
    // C++20, using std::countr_zero
    #include <cstdint>
    // using std::uint32_t
    
    std::size_t words = (rows * cols + 31) / 32;
    std::vector<std::uint32_t> bitmap(words);
    for(...) {
      std::size_t index = row_i * cols + col_j;
      bitmap[index / 32] |= 1 << (index % 32);
    }
    std::vector<std::pair<int, int>> indices;
    for(std::size_t i = 0; i < words; ++i) {
      std::uint32_t word = bitmap[i];
      while(word) {
        std::uint32_t bit = std::countr_zero(word);
        std::size_t index = i * 32 + bit;
        int row_i = index / cols;
        int col_j = index % cols;
        indices.emplace_back(row_i, col_j);
        word &= ~(1 << bit); // clear bit
      }
    }