Search code examples
pythonmd5scanning

Check if MD5 value exists in an index file


I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.

Below is my code

The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line is not getting executed, probably because am not using '\n' as a suffix while adding the hash to the file. But I tried that its still not working.

Any ideas?

def computeMD5hash(string_for_hash):
    m = hashlib.md5()
    m.update(string_for_hash.encode('utf-8'))
    return m.hexdigest()


def writefilehash(formation_URL):
    fn="urlindex.idx"
    try:
        afile = open(fn, 'a')
        afile.write(computeMD5hash(formation_URL))
        afile.close()
    except IOError:
        print("Error writing to the index file")

fn="urlindex.idx"
try:
    afile = open(fn, 'r')
except IOError:
    afile = open(fn, 'w')

for f in files:
    formation=repouri + "/" + f
    #print(computeMD5hash(formation))
    md5url=computeMD5hash(formation)
    hashlist = afile.readlines()
    for line in hashlist:
        if str(md5url) in line:
            print ("Skipping " + formation + " because its already scanned and indexed as  " + line)
        else:
            if downloadengine(formation):
                print ("Download completed " + formation)
                print ("Starting to write to database..")
                #writetodatabase()
                print ("Writing hash value ..")
                writefilehash(formation)

print("Closing..")
afile.close()

Solution

  • You are testing in a loop. For every line that doesn't match, you download:

    line1
        if hash in line:
            print something
        else
            download
    line2
        if hash in line:
            print something
        else
            download
    line3
        if hash in line:
            print something
        else
            download
    

    If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.

    The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:

    try:
        with open(fn) as hashfile:
            hashes = {line.strip() for line in hashfile}
    except IOError:
        # no file yet, just use an empty set
        hashes = set()
    

    then when testing new hashes use:

    urlhash = computeMD5hash(formation)
    if urlhash not in hashes:
        # not seen before, download
        # record the hash
        hashes.add(urlhash)
        with open(fn, 'a') as hashfile:
            hashfile.write(urlhash + '\n')