I have a large collection of files that I'd like to recurse through and perform an md5 checksum.
Many of these files are stored on multiple physical disks but all mounted in the same directory:
/mnt/drive1/dir1/file.jpg
/mnt/drive2/dir1/file2.jpg
How can I recurse through /mnt without loading the entire directory and file structure into memory?
Is there a way to do this with multiple threads? It might not be necessary to recurse through the directories using multiple threads/processes, but the file operations can be CPU intensive which would benefit from multiple CPU cores.
Thanks in advance.
import multiprocessing
import os.path
import hashlib
import sys
VALID_EXTENSIONS = ('.JPG', '.GIF', '.JPEG')
MAX_FILE_SZ = 1000000
def md5_file(fname):
try:
with open(fname) as fo:
m = hashlib.md5()
chunk_sz = m.block_size * 128
data = fo.read(chunk_sz)
while data:
m.update(data)
data = fo.read(chunk_sz)
md5_file.queue.put((fname, m.hexdigest()))
except IOError:
md5_file.queue.put((fname, None))
def is_valid_file(fname):
ext = os.path.splitext(fname)[1].upper()
fsz = os.path.getsize(fname)
return ext in VALID_EXTENSIONS and fsz <= MAX_FILE_SZ
def init(queue):
md5_file.queue = queue
def main():
# Holds tuple (fname, md5sum) / md5sum will be none if an IOError occurs
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(None, init, [queue])
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
# Convert filenames to full paths...
full_path_fnames = map(lambda fn: os.path.join(dirpath, fn),
filenames)
full_path_fnames = filter(is_valid_file, full_path_fnames)
pool.map(md5_file, full_path_fnames)
# Dump the queue
while not queue.empty():
print queue.get()
return 0
if __name__ == '__main__':
sys.exit(main())
May not be bulletproof, but it works for me. You'll probably want to tweak it to provide some feedback as to what it is doing.
For some odd reason, you can't share a global queue. So, I had to use the pool's initializer
function. I'm not sure why this is the case.
Just pass the root directory to process as the sole argument, and it will dump out md5 sums when it is finished.