Search code examples
linuxmpislurmncosbatch

srun used in a loop: srun: Job step aborted: Waiting up to 32 seconds for job step to finish


I got a .sh file to run by srun because I want to see the dynamic print-out of the scripts. But by running srun job_spinup.sh southfr_exp 1 & I always got error (time-out due to time limited error) after 2 main loops...here is the main codes in the .sh file. By the way I want to run a model of 12 months and loop it by 20 times (so-called spin-up 20 times). But the error occurs in the November of second loop (spin-up)... Here is the code in the job_spinup.sh:

#!/bin/bash
#SBATCH -J spinup
#SBATCH -p knl_cache
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 10:00:00
#SBATCH -o spinup.log
#SBATCH -e spinup.log
#=========================================================================
# USAGE
#   nohup ./job_spinup DOM[:EXP] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM] &
#
# by default: EXP=spinup, N=20, START_ID=0, START_MM=1
#=========================================================================
#set -x
#
if [ $# -lt 2 ]; then
  echo "Usage: $0 DOM[:EXP:VERSION] nodes[:tasks_per_node:tasks_for_trip N START_ID:START_MM]"
  echo "DOM            = the name of a domain"
  echo "EXP            = the name of an experiment"
  echo "N              = the number of runnings"
  echo "START_ID       = start id of a running"
  echo "START_MM       = start month of a running"
  exit
fi

DOM=`echo $1 | awk '{split($1, f, ":"); print f[1]}'`
EXP=`echo $1 | awk '{split($1, f, ":"); print f[2]}'`
EXP=${EXP:-spinup}
VERSION=`echo $1 | awk '{split($1, f, ":"); print f[3]}'`
VERSION=${VERSION:--X0}
num_nodes=`echo ${2} | awk '{split($1, f, ":"); print f[1]}'`
tasks_per_node=`echo ${2} | awk '{split($1, f, ":"); print f[2]}'`
tasks_per_node=${tasks_per_node:-40}
tasks_for_trip=`echo ${2} | awk '{split($1, f, ":"); print f[3]}'`
tasks_for_trip=${tasks_for_trip:-1}
SPINUP_N=${3:-20}
START_ID=`echo $4 | awk '{split($1, f, ":"); print f[1]}'`
START_ID=${START_ID:-0}
START_MM=`echo $4 | awk '{split($1, f, ":"); print f[2]}'`
START_MM=${START_MM:-1}

# source ~/anaconda3/etc/profile.d/conda.sh
source $(conda info --base)/etc/profile.d/conda.sh
conda activate myenv
echo "***************************************"
echo " CONDA ENV ACTIVATED FOR NCO COMMAND"
echo "***************************************"
echo $SPINUP_N
#
# check if TRIP is used
LTRIP=`grep "LOASIS *= *T" OPTIONS/OPTIONS.nam | wc -l`
#
ulimit -s unlimited
ulimit -n 500000
ulimit -u 64000
unset I_MPI_PMI_LIBRARY
export OMP_NUM_THREADS=1
export DR_HOOK=0
export DR_HOOK_OPT=prof

...

YYYY=${YYYYMMDDHH::4}
MM=${YYYYMMDDHH:4:2}
j=$START_ID
while [ $j -lt $SPINUP_N ] ; do

  echo " "
  echo "------------------"
  echo "SPINUP : $j / $SPINUP_N"

  while [ $MM -le 12 ] ; do
    if [ $LTRIP -eq 1 ]; then
      mpirun -np $((SLURM_NTASKS - tasks_for_trip)) offline.exe : -np $tasks_for_trip trip.exe &> offline
    else
      #echo ${SLURM_NTASKS}
      #mpirun -np ${SLURM_NTASKS} offline.exe &> offline
      #srun -n 1 offline.exe &> offline
      offline.exe &> offline
    fi
....

# Change dates to start again
    if [ $MM -eq 12 ]; then
      ncap2 -O -s "'DTCUR-YEAR'=$YYYY;'DTCUR-MONTH'=1;'DTCUR-DAY'=1;'DTCUR-TIME'=0" PREP.nc PREP.nc
      [ $LTRIP -eq 1 ] && ncap2 -O -s "date(:)={$YYYY,1,1,0}" TRIP_PREP.nc TRIP_PREP.nc
    fi

...


  done

  echo '------------------'
  echo ' '

  MM=01
  j=$(( j+1 ))

done
...
# end simulation
date >> date_$EXP
echo "***************************************"
echo "   SPINUP ENDS CORRECTLY"
echo "***************************************"

conda deactivate
echo "***************************************"
echo "   CONDA ENV DEACTIVATED"
echo "***************************************"

and the output is like this:

(base) [xushan@int2 southfr_exp]$ srun job_spinup.sh southfr_exp 1 &
[1] 11570
(base) [xushan@int2 southfr_exp]$ srun: job 8860513 queued and waiting for resources
srun: job 8860513 has been allocated resources
***************************************
 CONDA ENV ACTIVATED FOR NCO COMMAND
***************************************
20
./job_spinup.sh: line 62: ulimit: open files: cannot modify limit: Operation not permitted
***************************************
   READY TO START SPINUP on tcn991.bullx
     spinup 20 0:1
***************************************
 
------------------
SPINUP : 0 / 20
    199601
1
    199602
1
    199603
1
    199604
1
    199605
1
    199606
1
    199607
1
    199608
1
    199609
1
    199610
1
    199611
1
    199612
1
------------------
 
 
------------------
SPINUP : 1 / 20
    199601
1
    199602
1
    199603
1
    199604
1
    199605
1
    199606
1
    199607
1
    199608
1
    199609
1
    199610
1
srun: Force Terminated job 8860513
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 8860513.0 ON tcn991 CANCELLED AT 2020-09-07T12:51:24 DUE TO TIME LIMIT ***
srun: error: tcn991: task 0: Terminated
srun: Terminating job step 8860513.0

Is there anyone who can help me? thanks a lot! I am a beginner for slurm.....Is it because I activated a conda environment? and by squeue, I can see the queue lasts for 5 minutes only...no idea about why....is it because offline.exe?


Solution

  • srun does not read job scripts like sbatch does. This means that all your #SBATCH options are ignored, including the time limit you set for the job. Your job therefore goes to the default partition with the default time limit, which only seems to be enough time for two loops.

    There are multiple ways to solve it:

    1. Use sbatch and take a look at your output file (tail -f spinup.log)
    2. Use sbatch and attach to the job with sattach
    3. Add the #SBATCH options as parameters to srun