Search code examples
algorithmtime-complexityn-way-merge

Comparing five different sources


I need to write a function which will compare 2-5 "files" (well really 2-5 sets of database rows, but similar concept), and I have no clue of how to do it. The resulting diff should present the 2-5 files side by side. The output should show added, removed, changed and unchanged rows, with a column for each file.

What algorithm should I use to traverse rows so as to keep complexity low? The number of rows per file is less than 10,000. I probably won't need External Merge as total data size is in the megabyte range. Simple and readable code would of course also be nice, but it's not a must.

Edit: the files may be derived from some unknown source, there is no "original" to which the other 1-4 files can be compared to; all files will have to be compared to the others in their own right somehow.

Edit 2: I, or rather my colleague, realized that the contents may be sorted, as the output order is irrelevant. This solution means using additional domain knowledge to this part of the application, but also that diff complexity is O(N) and less complicated code. This solution is simple and I'll disregards any answers to this edit when I close the bounty. However I'll answer my own question for future reference.


Solution

  • If all of the n files (where 2 <= n <= 5 for the example) have to be compared to the others, then it seems to me that the number of combinations to compare will be C(n,2), defined by (in Python, for instance) as:

    def C(n,k): 
        return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
    

    Thus, you would have 1, 3, 6 or 10 pairwise comparisons for n = 2, 3, 4, 5 respectively.

    The time complexity would then be C(n,2) times the complexity of the pairwise diff algorithm that you chose to use, which would be an expected O(ND), in the case of Myers' algorithm, where N is the sum of the lengths of the two sequences to be compared, A and B, and D is the size of the minimum edit script for A and B.

    I'm not sure about the environment in which you need this code but difflib in Python, as an example, can be used to find the differences between all sorts of sequences - not just text lines - so it might be useful to you. The difflib documentation doesn't say exactly what algorithm it uses, but its discussion of its time complexity makes me think that it is similar to Myers'.