Search code examples
pythonpython-3.xdownloadurllibcommon-crawl

How to download multiple large files concurrently in python?


I am trying to download a series of Warc files from the CommonCrawl database, each of them about 25mb. This is my script:

import json
import urllib.request
from urllib.error import HTTPError

from src.Util import rooted

with open(rooted('data/alexa.txt'), 'r') as alexa:
    for i, url in enumerate(alexa):
        if i % 1000 == 0:
            try:
                request = 'http://index.commoncrawl.org/CC-MAIN-2018-13-index?url={search}*&output=json' \
                    .format(search=url.rstrip())
                page = urllib.request.urlopen(request)
                for line in page:
                    result = json.loads(line)
                    urllib.request.urlretrieve('https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
                                               rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum())))
            except HTTPError:
                pass

What this is currently doing is requesting the link to download the Warc file via the CommonCrawl REST API and then initiating the download into the 'data/warc' folder.

The problem is that in each urllib.request.urlretrieve() call, the program hangs until the file is completely downloaded before issuing the next download request. Is there any way the urllib.request.urlretrieve() call can be terminated as soon as the download has been issued and then the file downloaded after or some way to spin a new thread for each of these requests and have all the files downloading simultaneously?

Thanks


Solution

  • Use threads, futures even :)

    jobs = []
    with ThreadPoolExecutor(max_workers=100) as executor:
        for line in page:
    
            future = executor.submit(urllib.request.urlretrieve,
                                    'https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
                                     rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum()))
            jobs.append(future)
    ...
    for f in jobs:
        print(f.result())
    

    read more here: https://docs.python.org/3/library/concurrent.futures.html