Search code examples
bashdata-sciencescientific-computingpbstorque

Torque job randomly dying


I am running a Python3 script in batches of roughly 27 all with different input. The results are then saved to the results/$sizex$size folder. The working directory also must change to that folder so the program can save some images and data.

This my pbs script:

#!/bin/bash
#PBS -l nodes=1:ppn=28
#PBS -l mem=16gb
#PBS -l walltime=120:00:00

cd $PBS_O_WORKDIR

mkdir -p results

module purge
module load newmodules/1.0-Lmod  GCC/6.3.0-2.27  OpenMPI/2.0.2
module load Python/3.6.1

j=0
for i in $(seq 2 2 1024); do
    if [ "$j" -gt "28" ]; then
        wait;
        j=0;
    fi
    cd results
    mkdir -p $i"x"$i
    cd $i"x"$i
    time python3 $PBS_O_WORKDIR/model.py $i > result.txt &
    cd $PBS_O_WORKDIR
    ((j++))
    ((j++))
done

wait

and these are the logs I am getting from running tracejob:

kill_task: not killing process (pid=142457/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142458 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142460/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142461 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142463/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142464 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142466/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142467 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142469/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142470 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142472/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142473 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142475/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142476 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142478/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142479 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142481/state=Z) with sig 15
02/27/2018 21:42:26.758 M    kill_task: killing pid 142482 task 1 with sig 15
02/27/2018 21:42:26.758 M    kill_task: not killing process (pid=142483/state=Z) with sig 15
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142442/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142445/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142448/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142451/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142454/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142457/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142460/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142463/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142466/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142469/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142472/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142475/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142478/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142481/state=Z) with sig 9
02/27/2018 21:42:26.788 M    kill_task: not killing process (pid=142483/state=Z) with sig 9
02/27/2018 21:42:26.788 M    scan_for_terminated: job 2205476.example.com task 1 terminated, sid=121918
02/27/2018 21:42:26.788 M    job was terminated
02/27/2018 21:42:26.818 M    obit sent to server
02/27/2018 21:42:26.882 M    removed job scrip

The jobs runs for an hour or so then randomly dies. I am not sure why. I have tried increasing the wall time but that isn't doing anything.

Basically my python script reads in multiples of 2 from 2 to 1024, and each script runs in parallel (in batches of 27 to avoid the server crashing/swapping out). Can anyone suggest why this is occurring?


Solution

  • So I fixed the issue by using GNU parallel. On some servers you may need to load the module like so: module load gnu-parallel

    Then in the pbs script I simply delete the for loop:

    parallel -j28 python3 model.py {1} > results{1}.txt ::: $(seq 100 -2 2)
    

    I also had to change the working directory in my program so the results wouldn't get overwritten.