So we have an application that has celery workers. We start those workers using an upstart file /etc/init/fact-celery.conf
that looks like the following:
description "FaCT Celery Worker."
start on runlevel [2345]
stop on runlevel [06]
respawn
respawn limit 10 5
setuid fact
setgid fact
script
[ -r /etc/default/fact ] && . /etc/default/fact
if [ "$START_CELERY" != "yes" ]; then
echo "Service disabled in '/etc/default/fact'. Not starting."
exit 1
fi
ARGUMENTS=""
if [ "$BEAT_SERVICE" = "yes" ]; then
ARGUMENTS="--beat"
fi
/usr/bin/fact celery worker --loglevel=INFO --events --schedule=/var/lib/fact/celerybeat-schedule --queues=$CELERY_QUEUES $ARGUMENTS
end script
It calls a python wrapper script that looks like the following:
#!/bin/bash
WHOAMI=$(whoami)
PYTHONPATH=/usr/share/fact
PYTHON_BIN=/opt/fact-virtual-environment/bin/python
DJANGO_SETTINGS_MODULE=fact.settings.staging
if [ ${WHOAMI} != "fact" ];
then
sudo -u fact $0 $*;
else
# Python needs access to the CWD, but we need to deal with apparmor restrictions
pushd $PYTHONPATH &> /dev/null
PYTHONPATH=${PYTHONPATH} DJANGO_SETTINGS_MODULE=${DJANGO_SETTINGS_MODULE} ${PYTHON_BIN} -m fact.managecommand $*;
popd &> /dev/null
fi
The trouble with this setup is that when we stop the service, we get left over pact-celery workers that don't die. For some reason upstart can't track the forked processes. I've read in some similar posts that upstart can't track more than two forks.
I've tried using expect fork
but then upstart just hangs whenever I try to start or stop the service.
Other posts I've found on this say to call the python process directly instead of using the wrapper script, but we've already built apparmor profiles around these scripts and there are other things in our workflow that are pretty dependent on them.
Is there any way, with the current wrapper scripts, to handle killing off all the celery workers on a service stop?
There is some discussion about this in the Workers Guide, but basically the usual process is to send a TERM
signal to the worker, which will cause it to wait for all the currently running tasks to finish before exiting clean.
Alternatively, you can send the KILL
signal if you want it to stop immediately with potential data loss, but then as you said celery isn't able to intercept the signal and cleanup the children in that case. The only recourse that is mentioned is to manually clean up the children like this:
$ ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9