Search code examples
pythonpython-multiprocessing

Constantly running Pool of workers


I'm using multiprocessor.Pool to parallelize processing some files. The code waits for a file to be received, then sends that file to a worker using Pool.apply_async, which then processes the file.

This code is supposed to be always running, therefore I don't ever close the pool. This however causes the pool to consume a lot of memory over time.

The code is something like this:

if __name__ == "__main__":
    with Pool(processes=PROCESS_COUNT) as pool:
        while True:
            f = wait_for_file()
            pool.apply_async(process_file, (f,))

How can I prevent high memory usage from happening without closing the pool?


Solution

  • Yes, if you allocate resources and you don't deallocate them be it number of spawned processes or simply (a chunk of) memory, you'll have less resources for other tasks on your machine until you or your system willingly or forcefully deallocate them.

    You may want to use maxtasksperchild argument for Pool to attempt killing the slaves e.g. if they allocate memory and you have a leak somewhere, so you save at least some resources.

    Note: Worker processes within a Pool typically live for the complete duration of the Pool’s work queue. A frequent pattern found in other systems (such as Apache, mod_wsgi, etc) to free resources held by workers is to allow a worker within a pool to complete only a set amount of work before being exiting, being cleaned up and a new process spawned to replace the old one. The maxtasksperchild argument to the Pool exposes this ability to the end user.

    Alternatively, don't roll your own implementation of Pool because until you get there it'll be buggy and you'll unnecessarily burn the time. Instead use e.g. Celery (tutorial) which hopefully even has tests for nasty corner-cases you might spend more time on than necessary.

    Or, if you want to experiment a bit, here is a similar question which provides steps to custom slave pool management.