I have a list of URLs for large files to download (e.g. compressed archives), which I want to process (e.g. decompress the archives).
Both download and processing take a long time and processing is heavy on disk IO, so I want to have just one of each to run at a time. Since the two tasks take about the same time and do not compete for the same resources, I want to download the next file(s) while the last is being processed.
This is a variation of the producer-consumer problem.
The situation is similar to reading and processing images or downloading loads of files, but my downloader calls are not (yet) picklable, so I have not been able to use multiprocessing, and both tasks take about the same time.
Here is a dummy example, where both download and processing are blocking:
import time
import posixpath
def download(urls):
for url in urls:
time.sleep(3) # this is the download (more like 1000s)
yield posixpath.basename(url)
def process(fname):
time.sleep(2) # this is the processing part (more like 600s)
urls = ['a', 'b', 'c']
for fname in download(urls):
process(fname)
print(fname)
How could I make the two tasks concurrent? Can I use yield
or yield from
in a smart way, perhaps in combination with deque
? Or must it be asyncio
with Future
?
I'd simply use threading.Thread(target=process, args=(fname,))
and start a new thread for processing.
But before that, end last processing thread :
t = None
for fname in download(urls):
if t is not None: # wait for last processing thread to end
t.join()
t = threading.Thread(target=process, args=(fname,))
t.start()
print('[i] thread started for %s' % fname)