I'd like to have an array of functions in Python. Like that:
def download(x,y):
r = requests.get(x, allow_redirects=True)
open(y, 'wb').write(r.content)
array = [download(url,filename),download(url2,filename2)]
Now I want to use multiprocessing or multithreading (depending on what's better) to run all of them at the same time.
But the amount of functions may increase so therefore I'm struggeling to run multithreading.
Any suggestions?
With multiprocessing, something like this.
multiprocessing.pool.ThreadPool()
instead to run a single process.Pool()
spawns a worker for each CPU you have. You may wish to adjust that up or down.requests
Session is more efficient than using requests.get()
directly (TCP, HTTP, TLS overhead).resp.raise_for_status()
; that means you could end up saving a 404 or 500 response as if everything went alright. (You may wish to add some exception handling now; any job failing will propagate up to close the pool and kill other jobs.)stream=True
and resp.iter_content()
will avoid buffering the content to memory.imap_unordered()
is the fastest of the high-level pool operations, but as the name implies, loses order of the jobs. Shouldn't matter in this case.chunksize
parameter, which basically means that a single worker will be sent multiple work items.import multiprocessing
import requests
sess = requests.Session()
def download(job):
url, filename = job
resp = sess.get(url, allow_redirects=True, stream=True)
resp.raise_for_status()
with open(filename, "wb") as f:
for chunk in resp.iter_content(524288):
f.write(chunk)
def main():
jobs = [
(url, filename),
(url2, filename2),
]
with multiprocessing.Pool() as p:
for _ in p.imap_unordered(download, jobs):
pass
if __name__ == "__main__":
main()