I'm trying to run a command on roughly 10,000 files. The command does not require any information from other files, and each command deposits the output in a separate txt file. Currently, I am using a loop, but am trying to find out if there is a way to run the commands in parallel.
I can increase the amount of nodes I'm using, the amount of tasks per node, the cpus per task, the allocated memory, and the time allowed for the loop to complete.
What would be the best way to run the command in parallel as much as the system can handle?
Here is my loop:
for file in *.ctl; do
/N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml "${file}"
done
Thank you for your time,
-tbiewerh
"as much as the system can handle" is the difficult part, here. This depends on the available resources and their consumption by your jobs. Assuming you can yourself identify a maximum number of concurrent jobs, say 12, there are several options:
find
and GNU xargs
find . -maxdepth 1 -type f -name '*.ctl' -print0 |
xargs -0 -n1 -P12 /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
find
lists all ctl
files separated by a NUL character (-print0
) and this is piped to GNU xargs
that also considers the inputs as NUL-separated (-0
), consumes one input argument per job (-n1
) and runs up to 12 jobs in parallel (-P12
).
Using NUL as a separator allows your file names to have arbitrary names, including with newline characters.
find
and GNU parallel
find . -maxdepth 1 -type f -name '*.ctl' -print0 |
parallel -0 -n1 -P12 /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
That's exactly the same as with xargs
. Note that GNU parallel
can determine the optimal number of jobs during a warm-up period. But the efficiency strongly depends on the use cases. Try -P0
instead of -P12
and see if it works better than with a fixed number. Note also that GNU parallel
can run jobs on several computers. See the manual for the details.
make
Use this only if your file names do not contain spaces. Create a file named Makefile
in your source directory (the one containing the ctl
files) and containing (replace the leading spaces before $(CODEML) $<
with a tab):
# Makefile
CODEML := /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
CTLS := $(wildcard *.ctl)
JOBS := $(patsubst %,%-job,$(CTLS))
.PHONY: all $(JOBS)
.DEFAULT_GOAL := all
all: $(JOBS)
$(JOBS): %-job: %
$(CODEML) $<
Type make -j12
. make
will run up to 12 jobs in parallel until all jobs complete. Using GNU make
may look uselessly complicated, compared to the other solutions, but there is an interesting benefit: you can ask make
to run codeml
only if the ctl
file changed since the last time. make
does this by comparing the last modification time of a result file and a source file.
Assume codeml foo.ctl
produces file foo.mlb
and you modify the Makefile
as follows (replace the leading spaces before $(CODEML) $<
with a tab):
# Makefile
CODEML := /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
CTLS := $(wildcard *.ctl)
MLBS := $(patsubst %.ctl,%.mlb,$(CTLS))
.PHONY: all
.DEFAULT_GOAL := all
all: $(MLBS)
%.mlb: %.ctl
$(CODEML) $<
Then, if you type make -j12
, make
will run a maximum of 12 jobs simultaneously, but it will only run the jobs for which the mlb
file is older than the ctl
file. This way, if you modify only 100 of your ctl
files and run make -j12
again, it should take far less than the first time because only 100 jobs will be launched, instead of 10000.