I am trying to use PBS Arrays to submit in parallel 5 jobs using the same program on different files. PBS will start five different copies of the script, each with a different integer in the PBS_ARRAYID
variable. The script would be run with: qsub script.pbs
My current code is below; while it works as-is, it's calculating the list of files multiple times in each batch process. Is there a more efficient way to do this?
#PBS -S /bin/bash
#PBS -t 1-5 #Makes the $PBS_ARRAYID have the integer values 1-5
#PBS -V
workdir="/user/test"
samtools sort` `find ${workdir}/*.bam | sed ${PBS_ARRAYID}'!d'` > `find ${workdir}/*.bam | sed ${PBS_ARRAYID}'!d' | sed "s/.bam/.sorted.bam/"`
#PBS -S /bin/bash
#PBS -t 0-4 #Makes the $PBS_ARRAYID have the integer values 0-4
#PBS -V
workdir="/user/test"
files=( "$workdir"/*.bam ) # Expand the glob, store it in an array
infile="${files[$PBS_ARRAYID]}" # Pick one item from that array
exec samtools sort "$infile" >"${infile%.bam}.sorted.bam"
Note:
files=( "$workdir"/*.bam )
performs a glob internal to bash (no ls
needed) and stores the results of that glob in an array for reuse.`...`
, or $(...)
-- has significant performance overhead, and is best avoided.exec
for the last command in the script tells the shell interpreter it can replace itself with that command, rather than needing to remain in memory.