I need to find an efficient way of find out what is different between two large sorted arrays. In other words, I need to find out what was added/deleted from one of them based on comparisons with the other. The sorting is optional, so if you think we can achieve something with no ordering, that is fine with me.
These two arrays are each one million elements long, so comparing them in memory at once is not feasible.
The background for this is simple. I am trying to get all new rows from remote legacy SQL (OpenEdge) table that does not have any way of telling what is new. I know this might sound strange, but it is a reality I am working with. So no triggers on data, no timestamps, nothing. This was resolved in another StackOverflow thread so I am not looking for ways to add this functionality to the remote table.
I have a copy of this table in my Postgresql local database to help with comparisons. I am doing the comparison over the network and using jRuby with a JDBC driver to inspect remote data. Up to now I tried to load both tables into Ruby arrays and do a standard array - array
, but this eats up why too much memory (the tables are each a million rows long).
Any other options for me to consider? Any algorithms I am not aware of?
If both arrays are already sorted, then you can go through both arrays at the same time and compare the elements using an index per array. If the elements at index i and j are equals, then advance both indexes. If the elements are different, then just advance the index where the element in the array is less than the element in the other array.
Here's the code for this method. Note that it assumes both arrays are sorted and the elements can be compared through ==
, <
and >
:
def compare_sorted_arrays a1, a2
i, j = 0, 0
output = []
while i < a1.length and j < a2.length do
if a1[i] == a2[j]
i += 1
j += 1
elsif a1[i] < a2[j]
output << a1[i]
i += 1
else
output << a2[j]
j += 1
end
end
i.upto(a1.length-1) do |x|
output << a1[x]
end
j.upto(a2.length-1) do |x|
output << a2[x]
end
output
end
And a simple test
puts (compare_sorted_arrays([1, 2, 3, 4,], [1, 3, 5, 7])).join(', ')
puts (compare_sorted_arrays([1, 3, 5, 7], [1, 2, 3, 4,])).join(', ')
Output:
2, 4, 5, 7
2, 4, 5, 7
Another option may be doing the symmetric difference directly in SQL as shown here: Does the SQL spec provide for a better way to do a exclusive ORing of two sets?
SELECT COALESCE(A.id, B.id) AS id
FROM InitialTable A
FULL OUTER JOIN TableToCompareWith B
ON A.id = B.id
WHERE A.id IS NULL OR B.id IS NULL