I want to run a program that runs and creates a checkpoint file. Then I want to run several variant configurations that all start from that checkpoint.
For example, if I run:
sbatch -n 1 -t 12:00:00 --mem=16g program.sh
And program.sh
looks like this:
#!/bin/sh
./set_checkpoint
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config1.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config2.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config3.sh
sbatch -n 1 -t 12:00:00 --mem=16g cpt_restore_config4.sh
Does this implement the desired effect?
In general this is not needed. You can allocate all the resources you want in main job script and use resources for specific task with srun
. Here is a basic example.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00
module load some_module
srun -n 4 -c 2 ./my_program arg1 arg2
srun -n 4 -c 2 ./my_other_program arg1 arg2
Note that we allocated 8 CPUs and used 4 for each task. Here, the two srun
tasks will run sequentially. To make it run in parallel, you can use this trick.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00
srun -n 4 -c 2 ./my_program arg1 arg2 &
srun -n 4 -c 2 ./my_other_program arg1 arg2 &
wait
Just keep in mind this might not work in several cases. I would suggest using a logger and redirect the STDOUT
and STDERR
to a file. Here is a simple example.
Alternatively, if your tasks are using a single file with different set of parameters, I suggest using argument parsing. In Python, I generally use Hydra's joblib extension. It gives you parallelism capability out of the box.