I need to find all IDs that are not in my table, between the min and max ID. Unfortunately MySQL has no simple way of generating sequences, so I thought it will be easier to do in the application.
IDs in the table are from an external source, and probably around 80% of the min-max range are empty spaces (around 40mln rows in 200mln range). As a primary key, they are already fetched sorted.
The problem is simply C = A\B
, but the set A
is somewhat special so I thought of a few ways to implement that:
Generate a std::set
A
of all integers between min and max ID and then remove from it the values in set/vector B
. However I'm not sure if std::set
will work well at this memory size and with this many removals...
Populate a std::vector
with std::set_difference
of vector A
and vector B
. However, this will almost double the memory requirements (because of B
being several times smaller in cardinality than A
).
As previous, but replace the vector A
with std::ranges::views::iota
?
Some custom algorithm, iterating on a range A
, checking if not exists in B
and pushing to C
.
Any other ideas?
What is the best choice (and data structure) for this problem? Ideally minimizing both memory and runtime, or at least with small tradeoffs.
Should I pre-allocate (reserve
) in any of these solutions?
You can use the following function -
#include <vector>
#include <set>
std::vector<int> findMissingRowIDs(int minID, int maxID, const std::set<int>& B /* B is the set of RowIDs present in the table*/)
{
std::vector<int> C; // Missing RowIDs
std::set<int>::iterator it = B.begin();
while(*it<minID) it++;
int i=minID; // assuming min is inclusive
while(i<=maxID) // assuming max is inclusive
{
if (it==B.end()) // for i > (last element of B)
{
C.push_back(i);
i++;
}
else if (i<*it) // checks if i is not in B
{
C.push_back(i);
i++;
}
else
{
it++;
i++;
}
}
return C;
}
The time complexity of the above function is O(maxID-minID) and space complexity is O(maxID-minID).
You can also do it in MySQL, like this -
SELECT @minID := 1, @maxID := 15;
WITH RECURSIVE N AS (SELECT @minID AS i UNION ALL SELECT i + 1 FROM N WHERE i <= @maxID)
SELECT i FROM N WHERE i NOT IN(SELECT rowID FROM TABLE_NAME);