Search code examples
pythonhttpdownloadurllib2

Methods for downloading multiple large files using python on a limited network


My team needs to pull 30+ files a day averaging roughly 5 to 10 gigabytes a piece. Timing the file doing one urllib2 request it takes roughly 1.5 - 2 hours per file done sequentially which leads to only 12 files downloaded per day. These 30+ files are generated daily and need to be pulled on top of all our other downloads and auto processes for our data analysis team. However if I could download several files at a time with minimal bandwidth loss that would be ideal.

I've found this method from some leftover code on our system but I am wondering if this is actually working better or just seeming to. From testing it it seems to work fine for 3 to 10 files but after that it slows down for additional instances. Also got kind of an issue. I want to open maybe 5 to 10 instances at a time because then I do notice a slow down in bandwidth. I think 5 is the sweet spot so how do I have script1 wait and check to make sure all the files have finished downloading in script1.py until opening up 5 more instances of script2.py iteratively. Would urllib3 be better? I am not too familiar with threading or multiprocess libraries.

#script1.py
import subprocess, time
lines = 0 
homepath = "C:\\Auto_tasks\\downloader\\logs"
url_list_local = "c:\\Requests\\download_urls.txt"
targets_file = open(url_list_local, 'r')
for line in targets_file:
    url = line.rstrip('\n')
    surl ("\"C:\\Python26\\python.exe"
\"C:\\Auto_tasks\\downloader\\scripts\\script2.py\" " + url + " \"" + homepath
+ "\"")
    subprocess.Popen(surl)
    lines += 1
    time.sleep(1)


#script2.py, individual instances opened simultaneously for n files
import urllib2, time, os, sys, shutil, subprocess
os.chdir("C:\\Auto_tasks\\downloader\\working") #sets directory where downloads will go
homepath = sys.argv[2]
url = sys.argv[1]
file_name = url.split('/')[-1]
surl ("\"C:\\Python26\\python.exe"
\"C:\\Auto_tasks\\downloader\\scripts\\script2.py\" " + url + " \"" + homepath
+ "\"")
try: 
    u = urllib2.urlopen(url)
except IOError:
    print "FAILED to start download, retrying..."
    time.sleep(30)
    subprocess.popen(surl)
src_file = "C:\\Auto_tasks\\downloader\\working\\" + file_name
dst_file = "C:\\Auto_tasks\\downloader\\completed"
shutil.move(src_file, dst_file)

Solution

    1. Downloading multiple files is a pretty common task. In the Linux world is wget which handles bandwidth and a lot of other features for you -- this tool might be available for Windows.

    2. For doing it with a pool of Python processes, here's one way to do it:

        # downpool.py
        import multiprocessing
        import os, shutil, sys, urllib
    
        def downloader(url):
            mylog = multiprocessing.get_logger()
            mylog.info('start')
            mylog.info('%s: downloading', url)
    
            # download to temporary directory
            (temp_path, _headers) = urllib.urlretrieve(url)
    
            # move to final directory, preserving file name
            dest_path = os.path.join('temp', os.path.basename(temp_path))
            shutil.move(temp_path, dest_path)
    
            mylog.info('%s: done', url)
            return dest_path
    
    
        plog = multiprocessing.log_to_stderr()
        import logging
        plog.setLevel(logging.INFO)
    
        download_urls = [ line.strip() for line in open( sys.argv[1] ) ]
    
        plog.info('starting parallel downloads of %d urls', len(download_urls))
        pool = multiprocessing.Pool(5)
        plog.info('running jobs')
        download_paths = list( pool.imap( downloader, download_urls ) )
        plog.info('done')
    
        print 'Downloaded:\n', '\n'.join( download_paths )