Search code examples
pythonconcurrencyyieldcoroutineyield-from

Concurrent download and processing of large files in python


I have a list of URLs for large files to download (e.g. compressed archives), which I want to process (e.g. decompress the archives).

Both download and processing take a long time and processing is heavy on disk IO, so I want to have just one of each to run at a time. Since the two tasks take about the same time and do not compete for the same resources, I want to download the next file(s) while the last is being processed.

This is a variation of the producer-consumer problem.

The situation is similar to reading and processing images or downloading loads of files, but my downloader calls are not (yet) picklable, so I have not been able to use multiprocessing, and both tasks take about the same time.

Here is a dummy example, where both download and processing are blocking:

import time
import posixpath

def download(urls):
    for url in urls:
        time.sleep(3)  # this is the download (more like 1000s) 
        yield posixpath.basename(url)

def process(fname):
    time.sleep(2)  # this is the processing part (more like 600s)

urls = ['a', 'b', 'c']
for fname in download(urls):
    process(fname)
    print(fname)

How could I make the two tasks concurrent? Can I use yield or yield from in a smart way, perhaps in combination with deque? Or must it be asyncio with Future?


Solution

  • I'd simply use threading.Thread(target=process, args=(fname,)) and start a new thread for processing.

    But before that, end last processing thread :

    t = None
    for fname in download(urls):
        if t is not None: # wait for last processing thread to end
            t.join()
        t = threading.Thread(target=process, args=(fname,))
        t.start()
        print('[i] thread started for %s' % fname)
    

    See https://docs.python.org/3/library/threading.html