python multithreading multiprocessing celery

Multiprocessing: Downsides of Forking?

We are experiencing an issue with Python Celery (which uses multiprocessing) where large periodic (scheduled) tasks consume massive amounts of memory for short bursts of time, but because the worker process lives through the life of the pool (MAX_TASKS_PER_CHILD=None), the memory is not garbage collected (ie. it is being "high-water" reserved).

(This problem is further worsened by Heroku, which sees a large, constant amount of memory allocated and turns it into swap, which decreases performance.)

We have found that by setting MAX_TASKS_PER_CHILD=1, we fork a new process (Celery worker instance) after every task, and memory is properly garbage collected. Sweet!

However, there are plenty of articles that suggest the same solution, but I have not identified any downsides. What are the potential downsides of forking a new process after every task?

My guesses would be:
1. CPU overhead (but probably a tiny amount)
2. Potential errors when forking (but I can't find any documentation on this)

Solution

Aside from the obvious increase in CPU overhead from repeated forking (not a big deal if the workers do enough work per task), one possible downside would be if the parent process continues to grow in size. If so, it increases the size of all the child processes (which are forking a larger and larger parent). This wouldn't matter so much (presumably little of the memory will be written, and therefore little copying is required and actual memory use won't be a major issue), but IIRC, the Linux overcommit heuristics assume that the COW memory will eventually be copied, and you could invoke the OOM killer even if you're nowhere near actually exceeding the heuristic limit in terms of private pages.

On Python 3.4 and higher, you can avoid this issue by explicitly setting your multiprocessing start method to forkserver on program launch (before doing any work the workers don't rely on), which will fork workers from a separate server process that should not dramatically increase in size.

Note: Above, I said "presumably little of the memory will be written, and therefore little copying is required and actual memory use won't be a major issue", but this is something of a lie on CPython. As soon as the cyclic garbage collector runs, the reference counts of all objects that can potentially participate in a reference cycle (e.g. all heterogenous container types, but not simple "leaf" types like int and float) are touched. Doing so causes the pages containing them to be copied, so you're actually consuming memory in both parent and child.

In 3.4, there was no good solution for long running child processes, the only options were:

Disable the cyclic garbage collector entirely before launching them (opens huge potential for memory leaks; cycles are formed fairly easily by all sorts of things, and anything referenced from a cycle will never be cleaned).
Do what you're doing and set MAX_TASKS_PER_CHILD=1 so even when processes do perform COW copies, they exit quickly and get replaced with new ones that are retied to the parent process and don't consume memory on their own.

That said, as of 3.7, there is a third option for when you're manually launching the processes yourself (or responsible for creating the pool):

import gc at the top of your file, and after initializing as much as you can, but before creating your first Process or Pool object, run:

gc.freeze()   # Moves all existing tracked objects to permanent generation,
              # so they're never looked at again, in parent or child

The gc.freeze docs further recommends disabling GC in the parent ASAP, freezeing just before fork, and reenabling gc in the children, to avoid COW triggered by other pre-fork garbage collection leaving memory gaps that can be filled by new allocations triggering COW (you leak some memory in the parent, in exchange for minimizing unsharing in children), so an even more complete solution might look like:

# Done as early as possible in the parent process to minimize freed gaps
# in shared pages that might get reused and trigger COW
gc.disable()  # Disables automatic garbage collection

# Done immediately before forking
gc.freeze()   # Moves all existing tracked objects to permanent generation so GC
              # never touches them
with multiprocessing.Pool(initializer=gc.enable) as pool:  # Reenables gc in each
                                                           # worker process on launch
    # Do stuff with pool
# Outside with block, done with pool
gc.enable()  # Optionally, if you never launch new workers,
             # reenable GC in parent process

You can read more about the rationale for this feature and intended use cases on CPython bug #31558, which describes the problem, created gc.freeze (and related functions) and explains the intended use case.