Search code examples
linuxbashshell

How do I track and kill all processes spawned by running a script without knowing the names of the subprocesses?


I have a Bash wrapper script that launches a complex modeling script, which in turn launches multiple subprocesses and scripts of its own. I want to figure out how to track all processes that are spawned by one run of the modeling script so as to kill them all when certain criteria are met.

For example, my wrapper script called pipeline_runner.sh does the following:

#!/bin/bash

# Some set up of the script ...

./monitor_job.sh ... arguments TBD ... &

script_path="path/to/bash/script"

chmod u+x "$script_path"
"$script_path"

# ...

Each run of pipeline_runner.sh will start an instance of monitor_job.sh in the background to monitor the specific run of path/to/bash/script launched by that run of pipeline_runner.sh. When some arbitrary condition defined in monitor_job.sh is met, it should be able to kill that specific run of path/to/bash/script, together with all processes directly or indirectly launched by it.

The multiple other processes launched by a run of path/to/bash/script are many and variable, so I am trying to figure out how to capture every script that is spawned from running this into some sort of group or list and be able to kill them all when needed. Killing just the initial $script_path process is insufficient because all that script's subprocesses will survive.

Important secondary goals are:

  • to make this dynamic so that it doesn't depend on which script is designated by $script_path. This means I can't just hard code specific command names to look for.

  • to perform the monitoring in a separate script (monitor_job.sh) as described, not directly in pipeline_runner.sh.

How can I track all the processes launched by the modeling script so as to be able to kill them all at need?


Solution

  • kill -- -$$ will terminate the whole process group.

    For example, in the following script we spawn 2 subprocesses, sleep 15 and sleep 30, then we could have various other tasks to run (in this case the sleep 5) and since we meet our exit criteria we can kill the whole process group.

    #!/bin/sh
    echo "Parent pid $$"
    sleep 15 &
    echo "child 1 pid $!"
    
    sleep 30 &
    echo "child 2 pid $!"
    
    sleep 5
    echo "criteria met"
    kill -- -$$
    

    If we run this with bash test.sh ; ps -ef | grep sleep, we get:

    $ bash test.sh ;  ps -ef | grep sleep
    Parent pid 87546
    child 1 pid 87547
    child 2 pid 87548
    criteria met
    Terminated: 15
      501 87595   789   0  2:08PM ttys007    0:00.00 grep sleep
    

    We observe therefore that the subprocesses have been killed as well.

    The problem with this approach is that if ctrl+c was entered straight after my execution, we would get:

    $ bash test.sh ;  ps -ef | grep sleep
    Parent pid 88352
    child 1 pid 88353
    child 2 pid 88354
    ^C
      501 88353     1   0  2:10PM ttys007    0:00.00 sleep 15
      501 88354     1   0  2:10PM ttys007    0:00.00 sleep 30
      501 88391   789   0  2:10PM ttys007    0:00.00 grep sleep
    

    This means that the various subprocesses would continue to run as orphans (adopted by the init process).

    To address this issue we could use trap, and change our script to:

    #!/bin/sh
    
    trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT
    
    echo "Parent pid $$"
    
    sleep 15 &
    echo "child 1 pid $!"
    
    sleep 30 &
    echo "child 2 pid $!"
    
    sleep 5
    echo "criteria met"
    exit 0
    

    Normal execution would remain the same because exit 0 will be caught by trap and in turn kill -- -$$ would be executed as before.

    Now if we run our script and enter ctrl+c straight after execution, this time around we get:

    Parent pid 91578
    child 1 pid 91579
    child 2 pid 91580
    ^CTerminated: 15
      501 91590   789   0  2:18PM ttys007    0:00.00 grep sleep
    

    where we see that subprocess have been killed as well.