Search code examples
pythonmultiprocessingpool

Multiprocessing with Pool by lowering memory usage in worker


My goal is to multi-process requests to an external api with hundreds of concurrent request. I do this with the Pool function, which works fine, but with 64 workers I get a RAM usage of 25Gb (proportional to the number of workers), which seems way to high for a simple http request.

How can I reduce RAM usage to its minimum so that I can launch hundreds of workers ?

My hypothesis is that the Pool function duplicates the whole process memory in every worker. How can I avoid that ?

The code:

from multiprocessing import Pool

def get_results(self, query):
    self.data["query"] = query["query"]
    results = requests.post(url_to_external_api_with_query_data).json()
    return {"results":results, "original_query":query["original_query"], "original_query_string":query["query"]}

def multiprocess_results(self, queries):
    pool = Pool(64)
    results_all = pool.map(self.get_results, queries)
    pool.close()
    pool.join()

Solution

  • Well first of, if you are sending requests to a HTTPS that is i/o bound so you may want to consider using multi-threading instead of multiprocessing which should fix your memory problem right up. Also the thing with multiprocessing is that in creates duplicate processes which own their copy of the Python interpreter. so several worker codes are running in parallel hence the overall amount of memory you use should depend on what you are doing.

    For multiprocessing and multi-threading I often recommend the concurent.futures The reason i recommend it is that it automatically assigns workers for multiprocessing task depending on how much memory can spare(this can be overwritten when you please).

    And using this module can sometimes be easier than using your typical multiprocessing module in that you can get more with less code.

    from concurrent.futures import ProcessPoolExecutor
    
    ....
    
    with ProcessPoolExecutor() as executor:
        executor.submit(self.get_results, queries)
    

    Again, since this is sending HTTPS requests, it is an i/o bound operation and you should consider using multi-threading. Both operation in this module work alike

    from concurrent.futures import ThreadPoolExecutor
    
    ....
    
    with ThreadPoolExecutor() as executor:
         executor.submit(self.get_results, queries)