Search code examples
c++vectorcomparisonlarge-data

comparing vectors in a large data set


I have a two dimensional vector of integer type that contains a large amount of vectors ( i.e 18000 and above ), and there is a considerable amount amount of repetitive vectors in this pool. What I want to do is to detect the similar vectors and delete one of them. What I am currently doing is that I am comparing each vector against the entire pool using the following function :`

bool compareVectors(vector<int> a, vector<int> b)
{
    if (a.size() != b.size())
    {
        return false;
    }
    sort(a.begin(), a.end());
    sort(b.begin(), b.end());
    return (a == b);
}

But this doesn't efficiently do the process presumably due to the large amounts of comparisons I am making. Is there any possible efficient ways to do this ?


Solution

  • Preparations:

    1. First sort your vectors into buckets according to size.
    2. Buckets with only one vector means the vector is unique, output and remove.
    3. sort all remaining vectors in the remaining buckets

    start with i = 0

    Recursive Algorithm:

    for each bucket:

    1. sort vectors into buckets according to v.(i)
    2. Buckets with only one vector means the vector is unique, output and remove
    3. recurse into each bucket with i=i+1