My team needs to pull 30+ files a day averaging roughly 5 to 10 gigabytes a piece. Timing the file doing one urllib2 request it takes roughly 1.5 - 2 hours per file done sequentially which leads to only 12 files downloaded per day. These 30+ files are generated daily and need to be pulled on top of all our other downloads and auto processes for our data analysis team. However if I could download several files at a time with minimal bandwidth loss that would be ideal.
I've found this method from some leftover code on our system but I am wondering if this is actually working better or just seeming to. From testing it it seems to work fine for 3 to 10 files but after that it slows down for additional instances. Also got kind of an issue. I want to open maybe 5 to 10 instances at a time because then I do notice a slow down in bandwidth. I think 5 is the sweet spot so how do I have script1 wait and check to make sure all the files have finished downloading in script1.py until opening up 5 more instances of script2.py iteratively. Would urllib3 be better? I am not too familiar with threading or multiprocess libraries.
#script1.py
import subprocess, time
lines = 0
homepath = "C:\\Auto_tasks\\downloader\\logs"
url_list_local = "c:\\Requests\\download_urls.txt"
targets_file = open(url_list_local, 'r')
for line in targets_file:
url = line.rstrip('\n')
surl ("\"C:\\Python26\\python.exe"
\"C:\\Auto_tasks\\downloader\\scripts\\script2.py\" " + url + " \"" + homepath
+ "\"")
subprocess.Popen(surl)
lines += 1
time.sleep(1)
#script2.py, individual instances opened simultaneously for n files
import urllib2, time, os, sys, shutil, subprocess
os.chdir("C:\\Auto_tasks\\downloader\\working") #sets directory where downloads will go
homepath = sys.argv[2]
url = sys.argv[1]
file_name = url.split('/')[-1]
surl ("\"C:\\Python26\\python.exe"
\"C:\\Auto_tasks\\downloader\\scripts\\script2.py\" " + url + " \"" + homepath
+ "\"")
try:
u = urllib2.urlopen(url)
except IOError:
print "FAILED to start download, retrying..."
time.sleep(30)
subprocess.popen(surl)
src_file = "C:\\Auto_tasks\\downloader\\working\\" + file_name
dst_file = "C:\\Auto_tasks\\downloader\\completed"
shutil.move(src_file, dst_file)
Downloading multiple files is a pretty common task. In the Linux world is wget which handles bandwidth and a lot of other features for you -- this tool might be available for Windows.
For doing it with a pool of Python processes, here's one way to do it:
# downpool.py
import multiprocessing
import os, shutil, sys, urllib
def downloader(url):
mylog = multiprocessing.get_logger()
mylog.info('start')
mylog.info('%s: downloading', url)
# download to temporary directory
(temp_path, _headers) = urllib.urlretrieve(url)
# move to final directory, preserving file name
dest_path = os.path.join('temp', os.path.basename(temp_path))
shutil.move(temp_path, dest_path)
mylog.info('%s: done', url)
return dest_path
plog = multiprocessing.log_to_stderr()
import logging
plog.setLevel(logging.INFO)
download_urls = [ line.strip() for line in open( sys.argv[1] ) ]
plog.info('starting parallel downloads of %d urls', len(download_urls))
pool = multiprocessing.Pool(5)
plog.info('running jobs')
download_paths = list( pool.imap( downloader, download_urls ) )
plog.info('done')
print 'Downloaded:\n', '\n'.join( download_paths )