Search code examples
javapythonbashscriptingfile-comparison

What is the best way to compare 2 large files based off of the first token in each line?


I have 2 large files (each about 500k lines or 85mb) containing the checksum of the file and the filepath itself. What is the best way to get the differences between the files based on the checksum? I can write a Java program, script, etc. but the goal is it has to be efficient.

For example, I have FileA:

ec7a063d3990cf7d8481952ffb45f1d8b490b1b5  /home/user/first.txt
e0f886f2124804b87a81defdc38ad2b492458f34  /home/user/second.txt

File B:

650bc1eb1b24604819eb342f2ebc1bab464d9210  /home/user/third.txt
ec7a063d3990cf7d8481952ffb45f1d8b490b1b5  /home/user/blah/dup.txt

I want to output two files containing the unique files in File A and B.

UniqueA

e0f886f2124804b87a81defdc38ad2b492458f34  /home/user/second.txt

UniqueB

650bc1eb1b24604819eb342f2ebc1bab464d9210  /home/user/third.txt

In this case, "first.txt" and "dup.txt" are the same since their checksum is the same so I exclude it as not being unique. What is the most efficient way to do this? The files aren't sorted in any way.


Solution

  • So here's a quick answer, but it's not so efficient:

    $ join -v1 <(sort FileA) <(sort FileB) | tee UniqueA
    e0f886f2124804b87a81defdc38ad2b492458f34 /home/user/second.txt
    
    $ join -v2 <(sort FileA) <(sort FileB) | tee UniqueB
    650bc1eb1b24604819eb342f2ebc1bab464d9210 /home/user/third.txt
    

    The join command matches lines from two sorted files by the key (which by default is the first field with a default delimeter of space ). The commands above are not so efficient, though, because we are sorting the files twice: once to get the values unique to the first file (-v1) and then again to get the unique values from the second (-v2). I'll post some improvements shortly.

    You can get the values that are unique in a single invocation, but the original file is lost. See this code below:

    $ join -v1 -v2 <(sort FileA) <(sort FileB)
    650bc1eb1b24604819eb342f2ebc1bab464d9210 /home/user/third.txt
    e0f886f2124804b87a81defdc38ad2b492458f34 /home/user/second.txt
    

    At this point, we almost have our answer. We have all of the unmatched files from both files. Moreover, we've only sorted each file once. I believe this is efficient. However, you have lost the "origin" information. We can tag the rows with sed using this iteration or the code:

    $ join -v1 -v2 <(sort FileA | sed s/$/\ A/ ) <(sort FileB | sed s/$/\ B/ )
    650bc1eb1b24604819eb342f2ebc1bab464d9210 /home/user/third.txt B
    e0f886f2124804b87a81defdc38ad2b492458f34 /home/user/second.txt A
    

    At this point, we have our unique entries and we know the file they came from. If you must have the results in separate files, I imagine that you can accomplish this with awk ( or just more bash ). Here's one more iteration of the code with awk included:

    join -v1 -v2 <(sort FileA | sed s/$/\ A/ ) <(sort FileB | sed s/$/\ B/ ) |  awk '{ file="Unique" $3 ; print $1,$2 > file }