Search code examples
bashshellpbs

Using Portable Batch System (PBS) Arrays To Work On Different Files Concurrently


I am trying to use PBS Arrays to submit in parallel 5 jobs using the same program on different files. PBS will start five different copies of the script, each with a different integer in the PBS_ARRAYID variable. The script would be run with: qsub script.pbs

My current code is below; while it works as-is, it's calculating the list of files multiple times in each batch process. Is there a more efficient way to do this?

#PBS -S /bin/bash
#PBS -t 1-5       #Makes the $PBS_ARRAYID have the integer values 1-5
#PBS -V

workdir="/user/test"

samtools sort` `find ${workdir}/*.bam | sed ${PBS_ARRAYID}'!d'` > `find ${workdir}/*.bam | sed ${PBS_ARRAYID}'!d' | sed "s/.bam/.sorted.bam/"`

Solution

  • #PBS -S /bin/bash
    #PBS -t 0-4       #Makes the $PBS_ARRAYID have the integer values 0-4
    #PBS -V
    
    workdir="/user/test"
    
    files=( "$workdir"/*.bam )       # Expand the glob, store it in an array
    infile="${files[$PBS_ARRAYID]}"  # Pick one item from that array
    
    exec samtools sort "$infile" >"${infile%.bam}.sorted.bam"
    

    Note:

    • files=( "$workdir"/*.bam ) performs a glob internal to bash (no ls needed) and stores the results of that glob in an array for reuse.
    • Arrays are zero-indexed; thus, we're using 0-4 instead of 1-5.
    • Using command substitutions -- `...`, or $(...) -- has significant performance overhead, and is best avoided.
    • Using exec for the last command in the script tells the shell interpreter it can replace itself with that command, rather than needing to remain in memory.