Search code examples
linuxbashgnugnu-parallel

GNU Parallel - command not found


I'm trying to use GNU parallel together with SLURM to run several instances of the same script with different input parameters. For that, I allocate 3 nodes via SLURM and then I create several threads via GNU Parallel and these threads are running the Python scripts, each one of them utilizing just one CPU core.

Also, the scripts are quite memory-heavy, so that I need to be able to restart the job, if it fails because of insufficient RAM. For that, I resorted to use --retry-failed and --retries flags.

My problem is, that all jobs except the ones on the last node do finish with these output:

/bin/bash: 0: command not found
/bin/bash: 1: command not found
/bin/bash: 2: command not found
/bin/bash: 3: command not found
/bin/bash: 4: command not found
/bin/bash: 5: command not found
/bin/bash: 6: command not found

Obviously, my input is somehow misinterpreted, but I have no idea how, as I'm not an experienced user of GNU Parallel.

My jobscript looks like this:

#!/usr/bin/env bash
#SBATCH --job-name job-name
#SBATCH --cpus-per-task=1
#SBATCH --array=0-2

[ -z "$PARALLEL_SEQ" ] && { exec parallel --retry-failed --retries 5 --joblog joblog.txt -a numtasks $0 ; }

TASKS_PER_NODE=`cat numtasks | wc -l`

IDX=$(( ${TASKS_PER_NODE} * ${SLURM_ARRAY_TASK_ID} + ${PARALLEL_SEQ} - 1 ))

mkdir "res-${IDX}"
cd "res-${IDX}"
source ${HOME}/.bashrc
conda activate myenv
cp ../myscript.py .

python3 ./myscript.py ${IDX}

Solution

  • It is unclear to me what numtasks contain. Is is just a sequence?

    I would use a bash function. To me that is much more readable than conditionally exec $0.

    #!/usr/bin/env bash                                                                       
    #SBATCH --job-name job-name                                                               
    #SBATCH --cpus-per-task=1                                                                 
    #SBATCH --array=0-2                                                                       
    
    doit() {
        TASKS_PER_NODE=`cat numtasks | wc -l`
    
        IDX=$(( ${TASKS_PER_NODE} * ${SLURM_ARRAY_TASK_ID} + ${PARALLEL_SEQ} - 1 ))
    
        mkdir "res-${IDX}"
        cd "res-${IDX}"
        source ${HOME}/.bashrc
        conda activate myenv
        cp ../myscript.py .
    
        python3 ./myscript.py ${IDX}
    }
    export -f doit
    export SLURM_ARRAY_TASK_ID
    
    parallel --retry-failed --retries 5 --joblog joblog.txt -a numtasks doit
    

    You might also want to check out the options --memfree/--memsuspend.