Search code examples
pythonalgorithmduplicatespython-dedupe

PDF File dedupe issue with same content, but generated at different time periods from a docx


I working on a pdf file dedupe project and analyzed many libraries in python, which read files, then generate hash value of it and then compare it with the next file for duplication - similar to logic below or using python filecomp lib. But the issue I found these logic is like, if a pdf is generated from a source DOCX(Save to PDF) , those outputs are not considered duplicates - even content is exactly the same. Why this happens? Is there any other logic to read the content, then create a unique hash value based on the actual content.

def calculate_hash_val(path, blocks=65536):
file = open(path, 'rb')
hasher = hashlib.md5()
data = file.read()    
while len(data) > 0:
    hasher.update(data)
    data = file.read()
file.close()
return hasher.hexdigest()

Solution

  • One of the things that happens is that you save metadata to the file including the time of creation. It is invisible in the PDF, but that will make the hash different.

    Here is an explanation of how to find and strip out that data with at least one tool. I am sure that there are many others.