bash shell parallel-processing gnu-parallel

limit spawned parallel processes and exit all upon failure of any

I'm running some tests in parallel by calling a process from a script. Each process prints only to stdout > a file, and exits 0 iff successful (otherwise -1).

If and when a process exits with -1, I print something to its (or a related) output file (namely, the arguments it was called with), kill all other processes, and exit.

I have written a script using trap "..." CHLD to run some code when a subprocess exits and this works under certain conditions, but I find my script is not very robust. If I send a keyboard interrupt sometimes the subprocesses keep going, and sometimes the number of subprocesses simply overwhelm the machine(s) and none of them seem to advance.

I am using this on my quad core laptop as well as a cluster of 128 CPUs, over which subprocesses are distributed automatically. How do I run a large number of background subprocesses in a bash script, limited to some number of them running concurrently, and do something + exit if one of them returns with a bad code? I would also like the script to clean up after keyboard interrupt. Should I use GNU-parallel? how?

Here is a MWE of my script so far, which spawns subprocesses unhindered, annotated with what I think each part means. I got the idea to use trap from shell - get exit code of background process

$ cat parallel_tests.sh 
#!/bin/bash
# some help from https://stackoverflow.com/questions/1570262/shell-get-exit-code-of-background-process
handle_chld() {
        #echo pids are ${pids[@]}
    local tmp=() ###temporary storage for pids that haven't finished
        #for each pid that hadn't finished since the last trap
    for((i=0;i<${#pids[@]};++i)); do
                #if this pid is still running
        if [[ $(ps -p ${pids[i]} -o pid=) ]]
                then
                        tmp+=(${pids[i]}) ### add pid to list of pids that are running
                else
            wait ${pids[i]} ### put the exit code of this pid into $?
                        if [ "$?" != "0" ] ### if the exit code $? is non-zero
                        then
                                #kill all remaning processes
                                for((j=0;j<${#pids[@]};++j))
                                do
                                        if [[ $(ps -p ${pids[j]} -o pid=) ]]
                                        then
                                            echo killing child processes of ${pids[j]}
                                            pkill -P ${pids[j]}
                                        fi
                                done
                                cat _tmp${pids[i]}
                                #print things to the terminal here
                                echo "FAILED process ${pids[i]} args:   `cat _tmpargs${pids[i]}`"
                                exit 1
                        else
                                echo "FINISHED process ${pids[i]} args: `cat _tmpargs${pids[i]}`"
                        fi   
        fi
    done
        #update list of running pids
    pids=(${tmp[@]})
}
# set this to monitor SIGCHLD
set -o monitor
# call handle_chld() when SIGCHLD signal is triggered
trap "handle_chld" CHLD

ALL_ARGS="2 32 87" ### ad nauseam
for A in $ALL_ARGS; do
        (sleep $A; false) > _tmp$! &
        pids+=($!)
        echo $A > _tmpargs${pids[${#pids[@]}-1]}
        echo "STARTED process ${pids[${#pids[@]}-1]} args: `cat _tmpargs${pids[${#pids[@]}-1]}`"
done
echo "Every process started.  Now waiting on PIDS:"
echo ${pids[@]}
wait ${pids[@]} ###wait until every process is finished (or exit in the trap)

The output of this version after 2+epsilon seconds is:

$ ./parallel_tests.sh 
STARTED process 66369 args: 2
STARTED process 66374 args: 32
STARTED process 66381 args: 87
Every process started.  Now waiting on PIDS:
66369 66374 66381
killing child processes of 66374
./parallel_tests.sh: line 43: 66376 Terminated: 15          sleep $A
killing child processes of 66381
./parallel_tests.sh: line 43: 66383 Terminated: 15          sleep $A
FAILED process 66369 args:  2

Essentially, pid 66369 fails first, and the other two processes are dealt with in the trap. I have simplified the construction of the test processes here, so we can't assume that I'll manually insert waits before spawning new ones. Additionally, some of the test processes can be nearly instant. Essentially, I have a whole mess of test processes, long and short, starting as soon as resources can be allotted.

I'm not sure what's causing the problems I mentioned above, as this script uses several features that are new to me. General pointers are welcomed!

(I have seen this question and it does not answer my question)

Solution

cat arguments | parallel --halt now,fail=1 my_prg

Alternatively:

parallel --halt now,fail=1 my_prg ::: $ALL_ARGS

GNU Parallel is designed so it will also kill remote jobs. It does that using process groups and heavy perl scripting on the remote server: https://www.gnu.org/software/parallel/parallel_design.html#The-remote-system-wrapper