Search code examples
slurmsbatch

Can I call sbatch recursively?


I want to run a program that runs and creates a checkpoint file. Then I want to run several variant configurations that all start from that checkpoint.

For example, if I run:

sbatch -n 1 -t 12:00:00 --mem=16g program.sh

And program.sh looks like this:

#!/bin/sh

./set_checkpoint

sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config1.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config2.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config3.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config4.sh

Does this implement the desired effect?


Solution

  • In general this is not needed. You can allocate all the resources you want in main job script and use resources for specific task with srun. Here is a basic example.

    #!/bin/bash
    #SBATCH --nodes=2
    #SBATCH --ntasks=8
    #SBATCH --cpus-per-task=2
    #SBATCH --time=01:00:00
    
    module load some_module
    srun -n 4 -c 2 ./my_program arg1 arg2
    srun -n 4 -c 2 ./my_other_program arg1 arg2
    

    Note that we allocated 8 CPUs and used 4 for each task. Here, the two srun tasks will run sequentially. To make it run in parallel, you can use this trick.

    #!/bin/bash
    #SBATCH --nodes=2
    #SBATCH --ntasks=8
    #SBATCH --cpus-per-task=2
    #SBATCH --time=01:00:00
    
    srun -n 4 -c 2 ./my_program arg1 arg2 &
    srun -n 4 -c 2 ./my_other_program arg1 arg2 &
    
    wait
    

    Just keep in mind this might not work in several cases. I would suggest using a logger and redirect the STDOUT and STDERR to a file. Here is a simple example.

    Alternatively, if your tasks are using a single file with different set of parameters, I suggest using argument parsing. In Python, I generally use Hydra's joblib extension. It gives you parallelism capability out of the box.