Search code examples
c++stdvectorstd-rangesstdsetset-difference

Find integers in range which are not in a set


I need to find all IDs that are not in my table, between the min and max ID. Unfortunately MySQL has no simple way of generating sequences, so I thought it will be easier to do in the application.

IDs in the table are from an external source, and probably around 80% of the min-max range are empty spaces (around 40mln rows in 200mln range). As a primary key, they are already fetched sorted.

The problem is simply C = A\B, but the set A is somewhat special so I thought of a few ways to implement that:

Generate a std::set A of all integers between min and max ID and then remove from it the values in set/vector B. However I'm not sure if std::set will work well at this memory size and with this many removals...

Populate a std::vector with std::set_difference of vector A and vector B. However, this will almost double the memory requirements (because of B being several times smaller in cardinality than A).

As previous, but replace the vector A with std::ranges::views::iota?

Some custom algorithm, iterating on a range A, checking if not exists in B and pushing to C.

Any other ideas? What is the best choice (and data structure) for this problem? Ideally minimizing both memory and runtime, or at least with small tradeoffs. Should I pre-allocate (reserve) in any of these solutions?


Solution

  • C++ solution:

    You can use the following function -

    #include <vector>
    #include <set>
    
    std::vector<int> findMissingRowIDs(int minID, int maxID, const std::set<int>& B /* B is the set of RowIDs present in the table*/)
    {
        std::vector<int> C; // Missing RowIDs
        std::set<int>::iterator it = B.begin();
        
        while(*it<minID) it++;
        
        int i=minID; // assuming min is inclusive
        
        while(i<=maxID) // assuming max is inclusive
        {
            if (it==B.end()) // for i > (last element of B)
            {
                C.push_back(i);
                i++;
            }
            else if (i<*it) // checks if i is not in B
            {
                C.push_back(i);
                i++;
            }
            else
            {
                it++;
                i++;
            }
        }
        
        return C;
    }
    

    The time complexity of the above function is O(maxID-minID) and space complexity is O(maxID-minID).

    MySQL solution:

    You can also do it in MySQL, like this -

    SELECT @minID := 1, @maxID := 15;
    
    WITH RECURSIVE N AS (SELECT @minID AS i UNION ALL SELECT i + 1 FROM N WHERE i <= @maxID)
      SELECT i FROM N WHERE i NOT IN(SELECT rowID FROM TABLE_NAME);