Search code examples
hpcslurm

How to save/record SLURM script's config parameters to the output file?


I'm new to HPC and SLURM in particular. Here is an example code that I use to run my python script:

#!/bin/bash

# Slurm submission script, serial job

#SBATCH --time 48:00:00
#SBATCH --mem 0
#SBATCH --mail-type ALL
#SBATCH --partition gpu_v100
#SBATCH --gres gpu:4
#SBATCH --nodes 4
#SBATCH --ntasks-per-node=4


#SBATCH --output R-%x.%j.out
#SBATCH --error R-%x.%j.err

export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

module load python3-DL/torch/1.6.0-cuda10.1

srun python3 contrastive_module.py \
      --gpus 4 \
      --max_epochs 1024 \
      --batch_size 256 \
      --num_nodes 4 \
      --num_workers 8 \

Now everytime I run this script using sbatch run.sl it generates two .err and .out files that I can only encode the "run.sl" filename and Job ID into these two filenames. but how can I save a copy of all the parameters i set in the script above whether for the slurm configs or the python code arguments tied to the Job ID and the generated .out and .err files?

For example if i run the script above 4 times in a row but each time with a different parameters its not clear from those files which correspond to which unless i manually keep a track of the parameters and JOB IDs. there should be some way to automate this in SLURM no?


Solution

  • You add the following two lines at the end of your submission script:

    scontrol show job $SLURM_JOB_ID
    scontrol write batch_script $SLURM_JOB_ID -
    

    This will write the job description and the job submission script at the end of the .out file.