Starting and stopping celery processes in upstart with a python wrapper script

So we have an application that has celery workers. We start those workers using an upstart file /etc/init/fact-celery.conf that looks like the following:

description "FaCT Celery Worker."
start on runlevel [2345]
stop on runlevel [06]

respawn
respawn limit 10 5

setuid fact
setgid fact

script
  [ -r /etc/default/fact ] && . /etc/default/fact

  if [ "$START_CELERY" != "yes" ]; then
    echo "Service disabled in '/etc/default/fact'. Not starting."
    exit 1
  fi

  ARGUMENTS=""

  if [ "$BEAT_SERVICE" = "yes" ]; then
    ARGUMENTS="--beat"
  fi

  /usr/bin/fact celery worker --loglevel=INFO --events --schedule=/var/lib/fact/celerybeat-schedule --queues=$CELERY_QUEUES $ARGUMENTS
end script

It calls a python wrapper script that looks like the following:

#!/bin/bash

WHOAMI=$(whoami)
PYTHONPATH=/usr/share/fact
PYTHON_BIN=/opt/fact-virtual-environment/bin/python
DJANGO_SETTINGS_MODULE=fact.settings.staging

if [ ${WHOAMI} != "fact" ];
then
  sudo -u fact $0 $*;
else
  # Python needs access to the CWD, but we need to deal with apparmor restrictions
  pushd $PYTHONPATH &> /dev/null
  PYTHONPATH=${PYTHONPATH} DJANGO_SETTINGS_MODULE=${DJANGO_SETTINGS_MODULE}  ${PYTHON_BIN} -m fact.managecommand $*;
  popd &> /dev/null
fi

The trouble with this setup is that when we stop the service, we get left over pact-celery workers that don't die. For some reason upstart can't track the forked processes. I've read in some similar posts that upstart can't track more than two forks.

I've tried using expect fork but then upstart just hangs whenever I try to start or stop the service.

Other posts I've found on this say to call the python process directly instead of using the wrapper script, but we've already built apparmor profiles around these scripts and there are other things in our workflow that are pretty dependent on them.

Is there any way, with the current wrapper scripts, to handle killing off all the celery workers on a service stop?

Solution

There is some discussion about this in the Workers Guide, but basically the usual process is to send a TERM signal to the worker, which will cause it to wait for all the currently running tasks to finish before exiting clean.

Alternatively, you can send the KILL signal if you want it to stop immediately with potential data loss, but then as you said celery isn't able to intercept the signal and cleanup the children in that case. The only recourse that is mentioned is to manually clean up the children like this:

$ ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9