When a Heroku worker is restarted (either on command or as the result of a deploy), Heroku sends SIGTERM
to the worker process. In the case of delayed_job
, the SIGTERM
signal is caught and then the worker stops executing after the current job (if any) has stopped.
If the worker takes to long to finish, then Heroku will send SIGKILL
. In the case of delayed_job
, this leaves a locked job in the database that won't get picked up by another worker.
I'd like to ensure that jobs eventually finish (unless there's an error). Given that, what's the best way to approach this?
I see two options. But I'd like to get other input:
delayed_job
to stop working on the current job (and release the lock) when it receives a SIGTERM
.Any thoughts?
TLDR:
Put this at the top of your job method:
begin
term_now = false
old_term_handler = trap 'TERM' do
term_now = true
old_term_handler.call
end
AND
Make sure this is called at least once every ten seconds:
if term_now
puts 'told to terminate'
return true
end
AND
At the end of your method, put this:
ensure
trap 'TERM', old_term_handler
end
Explanation:
I was having the same problem and came upon this Heroku article.
The job contained an outer loop, so I followed the article and added a trap('TERM')
and exit
. However delayed_job
picks that up as failed with SystemExit
and marks the task as failed.
With the SIGTERM
now trapped by our trap
the worker's handler isn't called and instead it immediately restarts the job and then gets SIGKILL
a few seconds later. Back to square one.
I tried a few alternatives to exit
:
A return true
marks the job as successful (and removes it from the queue), but suffers from the same problem if there's another job waiting in the queue.
Calling exit!
will successfully exit the job and the worker, but it doesn't allow the worker to remove the job from the queue, so you still have the 'orphaned locked jobs' problem.
My final solution was the one given at at the top of my answer, it comprises of three parts:
Before we start the potentially long job we add a new interrupt handler for 'TERM'
by doing a trap
(as described in the Heroku article), and we use it to set term_now = true
.
But we must also grab the old_term_handler
which the delayed job worker code set (which is returned by trap
) and remember to call
it.
We still must ensure that we return control to Delayed:Job:Worker
with sufficient time for it to clean up and shutdown, so we should check term_now
at least (just under) every ten seconds and return
if it is true
.
You can either return true
or return false
depending on whether you want the job to be considered successful or not.
Finally it is vital to remember to remove your handler and install back the Delayed:Job:Worker
one when you have finished. If you fail to do this you will keep a dangling reference to the one we added, which can result in a memory leak if you add another one on top of that (for example, when the worker starts this job again).