Search code examples
bashparallel-processing

In bash, how would I start another command before the first is finished?


I'm trying to run a command on roughly 10,000 files. The command does not require any information from other files, and each command deposits the output in a separate txt file. Currently, I am using a loop, but am trying to find out if there is a way to run the commands in parallel.

I can increase the amount of nodes I'm using, the amount of tasks per node, the cpus per task, the allocated memory, and the time allowed for the loop to complete.

What would be the best way to run the command in parallel as much as the system can handle?

Here is my loop:

for file in *.ctl; do
        /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml "${file}"
done

Thank you for your time,

-tbiewerh


Solution

  • "as much as the system can handle" is the difficult part, here. This depends on the available resources and their consumption by your jobs. Assuming you can yourself identify a maximum number of concurrent jobs, say 12, there are several options:

    find and GNU xargs

    find . -maxdepth 1 -type f -name '*.ctl' -print0 |
    xargs -0 -n1 -P12 /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
    

    find lists all ctl files separated by a NUL character (-print0) and this is piped to GNU xargs that also considers the inputs as NUL-separated (-0), consumes one input argument per job (-n1) and runs up to 12 jobs in parallel (-P12).

    Using NUL as a separator allows your file names to have arbitrary names, including with newline characters.

    find and GNU parallel

    find . -maxdepth 1 -type f -name '*.ctl' -print0 |
    parallel -0 -n1 -P12 /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
    

    That's exactly the same as with xargs. Note that GNU parallel can determine the optimal number of jobs during a warm-up period. But the efficiency strongly depends on the use cases. Try -P0 instead of -P12 and see if it works better than with a fixed number. Note also that GNU parallel can run jobs on several computers. See the manual for the details.

    GNU make

    Use this only if your file names do not contain spaces. Create a file named Makefile in your source directory (the one containing the ctl files) and containing (replace the leading spaces before $(CODEML) $< with a tab):

    # Makefile
    CODEML := /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
    CTLS   := $(wildcard *.ctl)
    JOBS   := $(patsubst %,%-job,$(CTLS))
    
    .PHONY: all $(JOBS)
    .DEFAULT_GOAL := all
    all: $(JOBS)
    $(JOBS): %-job: %
        $(CODEML) $<
    

    Type make -j12. make will run up to 12 jobs in parallel until all jobs complete. Using GNU make may look uselessly complicated, compared to the other solutions, but there is an interesting benefit: you can ask make to run codeml only if the ctl file changed since the last time. make does this by comparing the last modification time of a result file and a source file.

    Assume codeml foo.ctl produces file foo.mlb and you modify the Makefile as follows (replace the leading spaces before $(CODEML) $< with a tab):

    # Makefile
    CODEML := /N/project/tomWGS/OVULERNASEQ/scripts/moEvoScripts/Programs/paml/bin/codeml
    CTLS   := $(wildcard *.ctl)
    MLBS   := $(patsubst %.ctl,%.mlb,$(CTLS))
    
    .PHONY: all
    .DEFAULT_GOAL := all
    all: $(MLBS)
    %.mlb: %.ctl
        $(CODEML) $<
    

    Then, if you type make -j12, make will run a maximum of 12 jobs simultaneously, but it will only run the jobs for which the mlb file is older than the ctl file. This way, if you modify only 100 of your ctl files and run make -j12 again, it should take far less than the first time because only 100 jobs will be launched, instead of 10000.