This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:
#!/bin/bash
#enabling job control
set -m
cmd_array=( "$@" ) #array of commands to run in parallel
cmd_count=$# #number of commands to run
cmd_idx=0; #current index of command
cmd_pids=() #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD
#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
eval "$cmd" &
cmd_pids[$cmd_idx]=$! #keeping track of the job pid
echo "Job #$cmd_idx launched '$cmd']"
(( cmd_idx++ ))
done
#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
wait $pid
child_exit_code=$?
if [ $child_exit_code -ne 0 ]; then
echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
fi
(( idx++ ))
done
You can tell something is wrong when you try to run this the following command:
./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"
The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done. I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?
(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)
Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:
#!/bin/bash
set -m
trap "echo child_exit" SIGCHLD
function test() {
sleep $1
echo "'sleep $1' just returned now"
}
echo sleeping for 6 seconds in the background
test 6 &
pid=$!
echo sleeping for 2 second in the background
test 2 &
echo waiting on the 6 second sleep
wait $pid
echo "wait return code: $?"
If you run this you will get the following output:
linux:~$ sh test2.sh
sleeping for 6 seconds in the background
sleeping for 2 second in the background
waiting on the 6 second sleep
'sleep 2' just returned now
child_exit
wait return code: 145
lunux:~$ 'sleep 6' just returned now
Explanation:
As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status). Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.
Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."
So there you have it, what happens in the script above is the following:
sleep 6
starts in the backgroundsleep 3
starts in the backgroundwait
starts waiting on sleep 6
sleep 3
terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145sleep 6
terminates hence the "'sleep 6' just returned now" after the script already exited