Search code examples
shellshdistributed-computingpbs

Set number of gpus in PBS script from command line


I'm invoking a job with qsub myjob.pbs. In there, I have some logic to run my experiments, which includes running torchrun, a distributed utility for pytorch. In that command you can set the number of nodes and number of processes (+gpus) per node. Depending on the availability, I want to be able to invoke qsub with an arbitrary number of GPUs, so that both -l gpus= and torchrun --nproc_per_node= are set depending on the command line argument.

I tried, the following:

#!/bin/sh
#PBS -l "nodes=1:ppn=12:gpus=$1"

torchrun --standalone --nnodes=1 --nproc_per_node=$1  myscript.py

and invoked it like so:

qsub --pass "4" myjob.pbs

but I got the following error: ERROR: -l: gpus: expected valid integer, found '"$1"'. Is there a way to pass the number of GPUs to the script so that the PBS directives can read them?


Solution

  • The problem is that your shell sees PBS directives as comments, so it will not be able to expand arguments in this way. This means that the expansion of $1 will not be occur using:

    #PBS -l "nodes=1:ppn=12:gpus=$1"
    

    Instead, you can apply the -l gpus= argument on the command line and remove the directive from your PBS script. For example:

    #!/bin/sh
    #PBS -l ncpus=12
    set -eu
    
    torchrun \
        --standalone \
        --nnodes=1 \
        --nproc_per_node="${nproc_per_node}" \
        myscript.py
    

    Then just use a simple wrapper, e.g. run_myjob.sh:

    #!/bin/sh
    set -eu
    
    qsub \
        -l gpus="$1" \
        -v nproc_per_node="$1" \
        myjob.pbs
    

    Which should let you specify the number of gpus as a command-line argument:

    sh run_myjob.sh 4