Search code examples
background-processsupervisordworker

How to get supervisord to restart hung workers?


I have a number of Python workers managed by supervisord that should continuously print to stdout (after each completed task) if they are working properly. However, they tend to hang, and we've had difficulty finding the bug. Ideally supervisord would notice that they haven't printed in X minutes and restart them; the tasks are idempotent, so non-graceful restarts are fine. Is there any supervisord feature or addon that can do this? Or another supervisor-like program that has this out of the box?

We are already using http://superlance.readthedocs.io/en/latest/memmon.html to kill if memory usage skyrockets, which mitigates some of the hangs, but a hang that doesn't cause a memory leak can still cause the workers to reach a standstill.


Solution

  • One possible solution would be to wrap your python script in a bash script that'd monitor it and exit if there isn't output to stdout for a period of time.

    For example:

    kill-if-hung.sh

    #!/usr/bin/env bash
    set -e
    
    TIMEOUT=60
    LAST_CHANGED="$(date +%s)"
    
    {
        set -e
        while true; do
            sleep 1
            kill -USR1 $$
        done
    } &
    
    trap check_output USR1
    
    check_output() {
        CURRENT="$(date +%s)"
        if [[ $((CURRENT - LAST_CHANGED)) -ge $TIMEOUT ]]; then
            echo "Process STDOUT hasn't printed in $TIMEOUT seconds"
            echo "Considering process hung and exiting"
            exit 1
        fi
    }
    
    STDOUT_PIPE=$(mktemp -u)
    mkfifo $STDOUT_PIPE
    
    trap cleanup EXIT
    cleanup() {
        kill -- -$$ # Send TERM to child processes
        [[ -p $STDOUT_PIPE ]] && rm -f $STDOUT_PIPE
    }
    
    $@ >$STDOUT_PIPE || exit 2 &
    
    while true; do
        if read tmp; then
            echo "$tmp"
            LAST_CHANGED="$(date +%s)"
        fi
    done <$STDOUT_PIPE
    

    Then you would run a python script in supervisord like: kill-if-hung.sh python -u some-script.py (-u to disable output buffering, or set PYTHONUNBUFFERED).

    I'm sure you could imagine a python script that'd do something similar.