Search code examples
pythonexcelduplicatestextpad

How To Identify Files with Identical Content But a Different Arrangement of the Data


I'm testing an upgrade we ran on an application that processes data. I took archived data that has already run through the system before and comparing it with output from the newly upgraded application. I'm noticing that the data is the same but the arrangement of the data in the new output is different. For example, in the new file line 57's data used to be on line 43 in the old output. Is there a way to detect that the files contain identical content? When I run a file compare in TextPad or do an MD5 hash compare, it doesn't detect that the files have the same content. It sees them as different files.


Solution

  • As Enak and Dominique have mentioned, sorting text files line by line and then comparing the two will reveal with complete certainty if anything is missing or not.

    You might calculate some aggregate values of both files and compare them for sufficient proof though, which will be a lot faster. Are the number of words and characters the same? What about the number of different alphabets? Count all 26 alphabets in both files (you could also do the same for any character set of your choice), if their numbers match up exactly, there is a very high probability that both files contain the same information. This is on the same lines as your hashing approach, but obviously isn't as reliable.

    If you need to know with certainty, you will have to compare each line of file A with each line of file B somehow. If the lines are completely shuffled, sorting the lines in file A and B and then comparing the files will be the best option. If there is locality however (line number x of file A tends to stay around location x in file B), you might as well just compare the two files without sorting, but rather by starting your search for line x of file A around location x in file B.