Search code examples
c++algorithmperformance

What's the fastest way to compare two arrays of objects and extracting the differing objects?


Assume I have a Student structure

struct Student
{
    int id;
    std::string name;
    int age;
    float dob; // date of birth in seconds, this is just for simplicity's sake
};

And I have two different arrays of Students (I open, read and parse two different text files)

Student arr1[FIRST_ARR_SIZE];
Student arr2[SECOND_ARR_SIZE];

now, my problem is that I want to compare the two arrays and every Student object that exists in one array but doesn't exist in the other array should be stored in

Student output_arr1[]; // Student objects that only exist in arr1
Student output_arr2[]; // Student objects that only exist in arr2

Things to consider:

  1. I do not care what exactly was different - all I care about is that this object is not exactly present in the other array. (Even if the difference is literally just one bit, that's enough for me to consider it unique)
  2. I don't particularly care about memory usage, just want it to run as fast as possible
  3. This is NOT simply an equality check. I want the individual objects that differ to be stored.

The only two solutions I could think of, off of the top of my head, are:

  1. the O(N*M) brute force solution (i.e., check every combination from arr1 and arr2).
for (Student s : arr1) {
    // do stuff
    for (Student s : arr2) {
        // do stuff
    }
    // do stuff
}

Obviously this is not good at all.

Or, a much better solution for large-ish (tens / hundreds of thousands) N and M (which is true in my case),

  1. Sort it over the id or something and then go over the two arrays in just one two-pointer loop and linearly compare the objects pointed to in both arrays by the two pointers for equality, which would be O(NlogN + MlogM + max(N, M))

I was wondering if there were any clever algorithms to do this even quicker than those two?


Solution

  • Linear time, linear space:

    Put everything from the smaller array in a hash.

    Parse the large array. For each element in the large array:

    • If it is in the hash, delete it since it's in both
    • If it isn't in the hash, add it to your output

    At the end, everything remaining in the hash is only in the smaller array, so add those to the output as well.