Search code examples
pythonmultithreadingmultiprocessingpython-multiprocessingpython-multithreading

Python Multiprocessing all Functions within an Array


I'd like to have an array of functions in Python. Like that:

def download(x,y):
      r = requests.get(x, allow_redirects=True)
      open(y, 'wb').write(r.content)

array = [download(url,filename),download(url2,filename2)]

Now I want to use multiprocessing or multithreading (depending on what's better) to run all of them at the same time.

But the amount of functions may increase so therefore I'm struggeling to run multithreading.

Any suggestions?


Solution

  • With multiprocessing, something like this.

    • This is using a process pool; for a network IO bound operation, you could use multiprocessing.pool.ThreadPool() instead to run a single process.
    • By default, Pool() spawns a worker for each CPU you have. You may wish to adjust that up or down.
    • Using a requests Session is more efficient than using requests.get() directly (TCP, HTTP, TLS overhead).
    • You were missing resp.raise_for_status(); that means you could end up saving a 404 or 500 response as if everything went alright. (You may wish to add some exception handling now; any job failing will propagate up to close the pool and kill other jobs.)
    • Using stream=True and resp.iter_content() will avoid buffering the content to memory.
    • imap_unordered() is the fastest of the high-level pool operations, but as the name implies, loses order of the jobs. Shouldn't matter in this case.
    • If you have more jobs, you can optimize some of the pool overhead with the chunksize parameter, which basically means that a single worker will be sent multiple work items.
    import multiprocessing
    import requests
    
    sess = requests.Session()
    
    
    def download(job):
        url, filename = job
        resp = sess.get(url, allow_redirects=True, stream=True)
        resp.raise_for_status()
        with open(filename, "wb") as f:
            for chunk in resp.iter_content(524288):
                f.write(chunk)
    
    
    def main():
    
        jobs = [
            (url, filename),
            (url2, filename2),
        ]
    
        with multiprocessing.Pool() as p:
            for _ in p.imap_unordered(download, jobs):
                pass
    
    
    if __name__ == "__main__":
        main()