I am trying to implement some sort of Master-Worker system in a HPC with the resource manager SLURM, and I am looking for advices on how to implement such a system.
I have to use some python code that plays the role of the Master, in the sense that between batches of calculations the Master will run 2 seconds of its own calculations, before sending a new batch of work to the Workers. Each Worker must run an external executable over a single node of the HPC. The external executable (Gromacs) is itself MPI-enabled. There will be ~25 Workers and many batches of calculations.
sbatch run.sh
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
module load required_env_module_for_external_executable
srun python my_python_code.py
my_python_code.py
the current MPI rank, and use rank/node 0 to run the Master python codefrom mpi4py import MPI
name = MPI.Get_processor_name()
rank = MPI.COMM_WORLD.Get_rank()
size = MPI.COMM_WORLD.Get_size()
if rank == 0: # Master
run_initialization_and_distribute_work_to_Workers()
else: # Workers
start_Worker_waiting_for_work()
def start_Worker_waiting_for_work():
# here we are on a single node
executable = 'gmx_mpi'
exec_args = 'mdrun -deffnm calculation_n'
# create some relationship between current MPI rank
# and the one the executable should use ?
mpi_info = MPI.Info.Create()
mpi_info.Set('host', MPI.Get_processor_name())
commspawn = MPI.COMM_SELF.Spawn(executable, args=exec_args,
maxprocs=1, info=mpi_info)
commspawn.Barrier()
commspawn.Disconnect()
res_analysis = do_some_analysis() # check what the executable produced
return res_analysis
Can someone confirm that this approach seems valid for implementing the desired system ? Or is it obvious this has no chance to work ? If so, please, why ?
I am not sure that MPI.COMM_SELF.Spawn()
will make the executable inherit from the SLURM resource allocation. If not, how to fix this ? I think that MPI.COMM_SELF.Spawn()
is what I am looking for, but I'm not sure.
The external executable requires some environment modules to be loaded. If they are loaded at sbatch run.sh
, are they still loaded when I invoke from MPI.COMM_SELF.Spawn()
from my_python_code.py
?
As a slightly different approach, is it possible to have something like pre-allocations/reservations to book resources for the Workers, then use MPI.COMM_WORLD.Spawn()
together with the pre-allocations/reservations ? The goal is also to avoid entering the SLURM queue at each new batch, as this may waste a lot of clock time (hence the will to book all required resources at the very beginning).
Since the python Master has to always stay alive anyways, SLURM job dependencies cannot be useful here, can they ?
Thank you so much for any help you may provide !
In an attempt to keep my question simple, I first omited the fact that I actually had the Workers doing some analysis. But this work can be done on the Master using OpenMP multiprocessing, as Gilles Gouillardet suggested. It executes fast enough.
Then the Workers are necessary indeed, because each task takes about 20-25 min on a single Worker/Node.
I also added some bits about maintaining my own queue of tasks to be sent to the SLURM queue and ultimately to the Workers, just in case the number of tasks t would exceed a few tens/hundreds jobs. This should provide some flexibility also in the future, when re-using this code for different applications.
Probably this is fine like this. I will try to go this way and update these lines. EDIT: It works fine.
At first glance, this looks over convoluted to me:
A much simpler architecture would be to have one process on your frontend, that will do:
sbatch gromacs
(start several jobs in a row)If the slave is doing some work you do not want to serialize on the master, can you replace the MPI communications by using files on a shared filesystem? in that case, you can do the computation on the compute nodes within a GROMACS job, before and after executing GROMACS. If not, maybe TCP/IP based communications can do the trick.