Search code examples
bashshellsubprocesssignalswait

Bash: why wait returns prematurely with code 145


This problem is very strange and I cannot find any documentation about this online. In the following code snippet I am merely trying to run a bunch of sub-processes in parallel, printing something when they exit and collect/print their exit code at the end. I find that without catching SIGCHLD things work as I would expect however, things break when I catch the signal. Here is the code:

#!/bin/bash

#enabling job control
set -m

cmd_array=( "$@" )         #array of commands to run in parallel
cmd_count=$#               #number of commands to run
cmd_idx=0;                 #current index of command
cmd_pids=()                #array of child proc pids
trap 'echo "Child job existed"' SIGCHLD #setting up signal handler on SIGCHLD

#running jobs in parallel
while [ $cmd_idx -lt $cmd_count ]; do
  cmd=${cmd_array[$cmd_idx]} #retreiving the job command as a string
  eval "$cmd" &
  cmd_pids[$cmd_idx]=$!            #keeping track of the job pid
  echo "Job #$cmd_idx launched '$cmd']"
  (( cmd_idx++ ))
done

#all jobs have been launched, collecting exit codes
idx=0
for pid in "${cmd_pids[@]}"; do
  wait $pid
  child_exit_code=$?
  if [ $child_exit_code -ne 0 ]; then
    echo "ERROR: Job #$idx failed with return code $child_exit_code. [job_command: '${cmd_array[$idx]}']"
  fi
  (( idx++ ))
done

You can tell something is wrong when you try to run this the following command:

./parallel_script.sh "sleep 20; echo done_20" "sleep 3; echo done_3"

The interesting thing here is that you can tell as soon as the signal handler is called (when sleep 3 is done), the wait (which is waiting on sleep 20) is interrupted right away with a return code 145. I can tell the sleep 20 is still running even after the script is done. I can't find any documentation about such a return code from wait. Can anyone shed some light as to what is going on here?

(By the way if I add a while loop when I wait and keep on waiting while the return code is 145, I actually get the result I expect)


Solution

  • Thanks to @muru, I was able to reproduce the "problem" using much less code, which you can see below:

    #!/bin/bash
    
    set -m
    trap "echo child_exit" SIGCHLD
    
    function test() {
     sleep $1
     echo "'sleep $1' just returned now"
    }
    
    echo sleeping for 6 seconds in the background
    test 6 &
    pid=$!
    echo sleeping for 2 second in the background
    test 2 &
    echo waiting on the 6 second sleep
    wait $pid
    echo "wait return code: $?"
    

    If you run this you will get the following output:

    linux:~$ sh test2.sh
    sleeping for 6 seconds in the background
    sleeping for 2 second in the background
    waiting on the 6 second sleep
    'sleep 2' just returned now
    child_exit
    wait return code: 145
    lunux:~$ 'sleep 6' just returned now
    

    Explanation:

    As @muru pointed out "When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status." (c.f. Bash manual on Exit Status). Now what mislead me here is the "fatal" signal. I was looking for a command to fail somewhere when nothing did.

    Digging a little deeper in Bash manual on Signals: "When Bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed."

    So there you have it, what happens in the script above is the following:

    1. sleep 6 starts in the background
    2. sleep 3 starts in the background
    3. wait starts waiting on sleep 6
    4. sleep 3terminates and the SIGCHLD trap if fired interrupting wait, which returns 128 + SIGCHLD = 145
    5. my script exits since it does not wait anymore
    6. the background sleep 6 terminates hence the "'sleep 6' just returned now" after the script already exited