Search code examples
python-3.xmpihpcslurmmaster-slave

HPC SLURM and batch calls to MPI-enabled application in Master-Worker system


I am trying to implement some sort of Master-Worker system in a HPC with the resource manager SLURM, and I am looking for advices on how to implement such a system.

I have to use some python code that plays the role of the Master, in the sense that between batches of calculations the Master will run 2 seconds of its own calculations, before sending a new batch of work to the Workers. Each Worker must run an external executable over a single node of the HPC. The external executable (Gromacs) is itself MPI-enabled. There will be ~25 Workers and many batches of calculations.

What I have in mind atm (also see EDIT further below):

workflow

What I'm currently trying:

  1. Allocate via SLURM as many MPI tasks as I want to use nodes, within a bash script that I'm calling via sbatch run.sh
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12

module load required_env_module_for_external_executable

srun python my_python_code.py
  1. Catch within python my_python_code.py the current MPI rank, and use rank/node 0 to run the Master python code
from mpi4py import MPI

name = MPI.Get_processor_name()
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()

if rank == 0:  # Master
    run_initialization_and_distribute_work_to_Workers()

else:  # Workers
    start_Worker_waiting_for_work()
  1. Within the python code of the Workers, start the external (MPI-enabled) application using MPI.COMM_SELF.Spawn()
def start_Worker_waiting_for_work():

    # here we are on a single node

    executable = 'gmx_mpi'
    exec_args = 'mdrun -deffnm calculation_n'

    # create some relationship between current MPI rank
    # and the one the executable should use ?

    mpi_info = MPI.Info.Create()
    mpi_info.Set('host',  MPI.Get_processor_name())
    commspawn = MPI.COMM_SELF.Spawn(executable, args=exec_args,
                                    maxprocs=1, info=mpi_info)

    commspawn.Barrier()
    commspawn.Disconnect()

    res_analysis = do_some_analysis()  # check what the executable produced

    return res_analysis

What I would like some explanations on:

  1. Can someone confirm that this approach seems valid for implementing the desired system ? Or is it obvious this has no chance to work ? If so, please, why ?

  2. I am not sure that MPI.COMM_SELF.Spawn() will make the executable inherit from the SLURM resource allocation. If not, how to fix this ? I think that MPI.COMM_SELF.Spawn() is what I am looking for, but I'm not sure.

  3. The external executable requires some environment modules to be loaded. If they are loaded at sbatch run.sh, are they still loaded when I invoke from MPI.COMM_SELF.Spawn() from my_python_code.py ?

  4. As a slightly different approach, is it possible to have something like pre-allocations/reservations to book resources for the Workers, then use MPI.COMM_WORLD.Spawn() together with the pre-allocations/reservations ? The goal is also to avoid entering the SLURM queue at each new batch, as this may waste a lot of clock time (hence the will to book all required resources at the very beginning).

  5. Since the python Master has to always stay alive anyways, SLURM job dependencies cannot be useful here, can they ?

Thank you so much for any help you may provide !

EDIT: Simplification of the workflow

In an attempt to keep my question simple, I first omited the fact that I actually had the Workers doing some analysis. But this work can be done on the Master using OpenMP multiprocessing, as Gilles Gouillardet suggested. It executes fast enough.

Then the Workers are necessary indeed, because each task takes about 20-25 min on a single Worker/Node.

I also added some bits about maintaining my own queue of tasks to be sent to the SLURM queue and ultimately to the Workers, just in case the number of tasks t would exceed a few tens/hundreds jobs. This should provide some flexibility also in the future, when re-using this code for different applications.

Probably this is fine like this. I will try to go this way and update these lines. EDIT: It works fine.

workflow2


Solution

  • At first glance, this looks over convoluted to me:

    • there is no communication between a slave and GROMACS
    • there is some master/slave communications, but is MPI really necessary?
    • are the slaves really necessary? (e.g. can the master process simply serialize the computation and then directly start GROMACS?)

    A much simpler architecture would be to have one process on your frontend, that will do:

    1. prepare the GROMACS inputs
    2. sbatch gromacs (start several jobs in a row)
    3. wait for the GROMACS jobs to complete
    4. analyze the GROMACS outputs
    5. re-iterate or exit

    If the slave is doing some work you do not want to serialize on the master, can you replace the MPI communications by using files on a shared filesystem? in that case, you can do the computation on the compute nodes within a GROMACS job, before and after executing GROMACS. If not, maybe TCP/IP based communications can do the trick.