Search code examples
pythonmultiprocessingpoolcheckpoint

python, multiprocessing and dmtcp: checkpointing one process in Pool?


Is it possible to use python's integration of dmtcp to checkpoint a child process in parallel execution?

My situation is as follows: I have a multiprocessing.Pool with several workers receiving async jobs (using apply_async). Certain big jobs require all the resources (cpu cores & memory). When one of these jobs is accepted, I'd like to checkpoint all pending processes, kick them out execution, launch the big job and finally resume the checkpointed processes.


Solution

  • If you start your python program using dmtcp_launch python ... or dmtcp_launch ./myapp.py, all child processes created by the main process are automatically under checkpoint control. Thus, when you try to checkpoint the computation from within your main process, all other processes are checkpointed as well.

    I am not too familiar with multiprocessing.Pool to make detailed comments on that front, but from what I understood in one quick minute, you don't want to checkpoint your main process (scheduler). However, DMTCP will checkpoint restart the entire computation (including the scheduler) as a single unit. Is that acceptable? If not, the alternative is to not launch the scheduler under DMTCP control, but modify it to launch only the child/slave processes under checkpoint control. I am not sure if that's something you can do in you application.