I'm new to HPC and SLURM in particular. Here is an example code that I use to run my python script:
#!/bin/bash
# Slurm submission script, serial job
#SBATCH --time 48:00:00
#SBATCH --mem 0
#SBATCH --mail-type ALL
#SBATCH --partition gpu_v100
#SBATCH --gres gpu:4
#SBATCH --nodes 4
#SBATCH --ntasks-per-node=4
#SBATCH --output R-%x.%j.out
#SBATCH --error R-%x.%j.err
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
module load python3-DL/torch/1.6.0-cuda10.1
srun python3 contrastive_module.py \
--gpus 4 \
--max_epochs 1024 \
--batch_size 256 \
--num_nodes 4 \
--num_workers 8 \
Now everytime I run this script using sbatch run.sl
it generates two .err and .out files that I can only encode the "run.sl" filename and Job ID into these two filenames. but how can I save a copy of all the parameters i set in the script above whether for the slurm configs or the python code arguments tied to the Job ID and the generated .out and .err files?
For example if i run the script above 4 times in a row but each time with a different parameters its not clear from those files which correspond to which unless i manually keep a track of the parameters and JOB IDs. there should be some way to automate this in SLURM no?
You add the following two lines at the end of your submission script:
scontrol show job $SLURM_JOB_ID
scontrol write batch_script $SLURM_JOB_ID -
This will write the job description and the job submission script at the end of the .out
file.