Search code examples
c++setset-difference

Difference between two vector<MyType*> A and B


I've got two vector<MyType*> objects called A and B. The MyType class has a field ID and I want to get the MyType* which are in A but not in B. I'm working on a image analysis application and I was hoping to find a fast/optimized solution.


Solution

  • The unordered approach will typically have quadratic complexity unless the data is sorted beforehand (by your ID field), in which case it would be linear and would not require repeated searches through B.

    struct CompareId
    {
        bool operator()(const MyType* a, const MyType* b) const
        {
            return a>ID < b->ID;
        }
    };
    ...
    sort(A.begin(), A.end(), CompareId() );
    sort(B.begin(), B.end(), CompareId() );
    
    vector<MyType*> C;
    set_difference(A.begin(), A.end(), B.begin(), B.end(), back_inserter(C) );
    

    Another solution is to use an ordered container like std::set with CompareId used for the StrictWeakOrdering template argument. I think this would be better if you need to apply a lot of set operations. That has its own overhead (being a tree) but if you really find that to be an efficiency problem, you could implement a fast memory allocator to insert and remove elements super fast (note: only do this if you profile and determine this to be a bottleneck).

    Warning: getting into somewhat complicated territory.

    There is another solution you can consider which could be very fast if applicable and you never have to worry about sorting data. Basically, make any group of MyType objects which share the same ID store a shared counter (ex: pointer to unsigned int).

    This will require creating a map of IDs to counters and require fetching the counter from the map each time a MyType object is created based on its ID. Since you have MyType objects with duplicate IDs, you shouldn't have to insert to the map as often as you create MyType objects (most can probably just fetch an existing counter).

    In addition to this, have a global 'traversal' counter which gets incremented whenever it's fetched.

    static unsigned int counter = 0;
    unsigned int traversal_counter()
    {
        // make this atomic for multithreaded applications and
        // needs to be modified to set all existing ID-associated
        // counters to 0 on overflow (see below)
        return ++counter;
    }
    

    Now let's go back to where you have A and B vectors storing MyType*. To fetch the elements in A that are not in B, we first call traversal_counter(). Assuming it's the first time we call it, that will give us a traversal value of 1.

    Now iterate through every MyType* object in B and set the shared counter for each object from 0 to the traversal value, 1.

    Now iterate through every MyType* object in A. The ones that have a counter value which doesn't match the current traversal value(1) are the elements in A that are not contained in B.

    What happens when you overflow the traversal counter? In this case, we iterate through all the counters stored in the ID map and set them back to zero along with the traversal counter itself. This will only need to occur once in about 4 billion traversals if it's a 32-bit unsigned int.

    This is about the fastest solution you can apply to your given problem. It can do any set operation in linear complexity on unsorted data (and always, not just in best-case scenarios like a hash table), but it does introduce some complexity so only consider it if you really need it.