I need a way to sort and delete the duplicates of row and column indices of a sparse matrix. But at first I need to sort indices and later find a method to delete the duplicates.
Eigen::Matrix<std::pair<int, int>, Eigen::Dynamic, Eigen::Dynamic> unsorted_indices;
How can write a sort method for this? or method that I can achieve the both at once.
I can easily achieve this with
std::set<pair<int,int>>
[…] When unknows grow over a million then things are very costly
Matrix is the wrong data structure since you apparently make no use of it being two-dimensional. If std::set
works for you, you can use a plain old std::vector
#include <algorithm>
std::vector<std::pair<int, int>> indices;
for(...)
indices.emplace_back(row_i, col_j);
std::sort(indices.begin(), indices.end());
indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
indices.shrink_to_fit();
vector
is more memory efficient than std::set
(for the cost of a single entry in std::set
including allocator overhead you can put 6 entries into a vector
). The erase(unique(…)
stuff is known as the erase-remove idiom
If the number of duplicates is large, removing them early with a set-like structure may be better. However, std::set
has a relatively high memory overhead and costly O(log(n)) insertion cost. std::unordered_set
is more efficient with O(1) and has slightly lower memory usage but it requires you to write your own hash function.
Unless you need to avoid external libraries, use a flat hash table. They have a much lower memory overhead. Boost comes with one that supports std::pair
out-of-the-box:
#include <boost/unordered/unordered_flat_set.hpp>
boost::unordered_flat_set<std::pair<int, int>> set;
for(...)
set.insert({row_i, col_j});
std::vector<std::pair<int, int>> indices(set.begin(), set.end());
std::sort(indices.begin(), indices.end());
As an alternative, you can use a simple bitmap. The memory consumption is then fixed to rows * cols / 8
byte as opposed to the at least 8 byte per entry in the hash table (not accounting for the hash table's overhead). So if your matrix is filled at least 1/64 (1.5%), the bitmap is more memory-efficient. Scanning it for set bits also automatically produces a sorted sequence, saving us this step.
#include <bit>
// C++20, using std::countr_zero
#include <cstdint>
// using std::uint32_t
std::size_t words = (rows * cols + 31) / 32;
std::vector<std::uint32_t> bitmap(words);
for(...) {
std::size_t index = row_i * cols + col_j;
bitmap[index / 32] |= 1 << (index % 32);
}
std::vector<std::pair<int, int>> indices;
for(std::size_t i = 0; i < words; ++i) {
std::uint32_t word = bitmap[i];
while(word) {
std::uint32_t bit = std::countr_zero(word);
std::size_t index = i * 32 + bit;
int row_i = index / cols;
int col_j = index % cols;
indices.emplace_back(row_i, col_j);
word &= ~(1 << bit); // clear bit
}
}