I am trying to figure out a way to verifying if my code can cross-verify the existence of a url string's md5 conversion value in an index file and if yes skip the scan.
Below is my code
The url formed is converted to md5 string and then stored in a idx file once scan completes, the goal is future scans should not pickup the same url. The issue I see is if str(md5url) in line
is not getting executed, probably because am not using '\n' as a suffix while adding the hash to the file. But I tried that its still not working.
Any ideas?
def computeMD5hash(string_for_hash):
m = hashlib.md5()
m.update(string_for_hash.encode('utf-8'))
return m.hexdigest()
def writefilehash(formation_URL):
fn="urlindex.idx"
try:
afile = open(fn, 'a')
afile.write(computeMD5hash(formation_URL))
afile.close()
except IOError:
print("Error writing to the index file")
fn="urlindex.idx"
try:
afile = open(fn, 'r')
except IOError:
afile = open(fn, 'w')
for f in files:
formation=repouri + "/" + f
#print(computeMD5hash(formation))
md5url=computeMD5hash(formation)
hashlist = afile.readlines()
for line in hashlist:
if str(md5url) in line:
print ("Skipping " + formation + " because its already scanned and indexed as " + line)
else:
if downloadengine(formation):
print ("Download completed " + formation)
print ("Starting to write to database..")
#writetodatabase()
print ("Writing hash value ..")
writefilehash(formation)
print("Closing..")
afile.close()
You are testing in a loop. For every line that doesn't match, you download:
line1
if hash in line:
print something
else
download
line2
if hash in line:
print something
else
download
line3
if hash in line:
print something
else
download
If the hash is in line 1, then you still download, because the hash is not in line 2 or line 3. You should not decide to download until you tested all lines.
The best way to do this is to read all the hashes in one go, into a set object (because testing for containment against a set is faster). Remove the line separators:
try:
with open(fn) as hashfile:
hashes = {line.strip() for line in hashfile}
except IOError:
# no file yet, just use an empty set
hashes = set()
then when testing new hashes use:
urlhash = computeMD5hash(formation)
if urlhash not in hashes:
# not seen before, download
# record the hash
hashes.add(urlhash)
with open(fn, 'a') as hashfile:
hashfile.write(urlhash + '\n')