I have a slurm job which I launch using batch script, say:
#! /bin/bash -l
#SBATCH --job-name=job1
#SBATCH -o stdout.log
#SBATCH -e stderr.log
#SBATCH --ntasks=160
cd $WORK/job1
mpirun ./mympitask # 1.)
./collect_results # 2.) long-running sequential task.
The first step (1.) runs in parallel using MPI, however, the second step (2.) I need to do just needs one task and the rest of the tasks should be released so that I don't occupy them or spend useless CPU-time.
Is it possible to for example:
a) release all, except one tasks, and run the final step on one CPU?
b) specify a command that should be run after the sbatch job is done?
I was thinking about using an salloc call for the last step.
These two options are available with SLURM
1) Before running the sequential post processing task, you can
scontrol update job=$SLURM_JOBID NodeList=`hostname`
In order to shrink the job size to one node.
I do not know if and how to shrink the job to one core.
2) An other option is to submit two jobs, the post processing job being dependent on the MPI job:
sbatch mpijob.slurm
sbatch -d afterok:<mpijob SLURM jobid> postprocessing.slurm
The non trivial (this is not rocket science though) part is to automatically retrieve the jobid of the first job.