Search code examples
capistranodelayed-jobgod

delayed_job monitored by God - duplicate processes after restart


I'm monitoring delayed_job using God. This is my God config file.

  QUEUE = "slow"
  WORKERS = 14

  WORKERS.times do |num|
    God.watch do |w|
      w.name = "dj.#{num}"
      w.group = "tanda"

      w.uid = 'deployer'
      w.gid = 'deployer'

      w.start = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job start --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
      w.restart = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job restart --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
      w.stop = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job stop -i #{num}"

      w.start_grace = 30.seconds
      w.restart_grace = 30.seconds
      w.stop_grace = 30.seconds

      w.pid_file = "#{RAILS_ROOT}/tmp/pids/delayed_job.#{num}.pid"
      w.log = "#{RAILS_ROOT}/log/dj.#{num}.log"
      w.err_log = "#{RAILS_ROOT}/log/dj.#{num}.errors.log"

      w.behavior(:clean_pid_file)
      w.interval = 30.seconds
      w.dir = File.expand_path('.')

      w.env = {
        "RACK_ENV" => RAILS_ENV,
        "RAILS_ENV" => RAILS_ENV,
        "CURRENT_DIRECTORY" => RAILS_ROOT
      }

      w.start_if do |start|
        start.condition(:process_running) do |c|
          c.interval = 5.seconds
          c.running = false
        end
      end

      w.lifecycle do |on|
        on.condition(:flapping) do |c|
          c.to_state = [:start, :restart]
          c.times = 10
          c.within = 3.minutes
          c.transition = :unmonitored
          c.retry_in = 10.minutes
        end
      end
    end
  end

I'm then restarting these processes using Capistrano 2 on each deploy:

run("cd #{current_path} && rvmsudo god restart tanda")

When I start God, my ps output looks like this.

s -e -www -o pid,rss,command | grep delayed
31960 220804 delayed_job.0
31966 220152 delayed_job.8
31973 226012 delayed_job.9
31979 215176 delayed_job.1
31984 210260 delayed_job.13
31994 240424 delayed_job.3
31997 225248 delayed_job.11
32003 196364 delayed_job.5
32009 236192 delayed_job.6
32015 214540 delayed_job.12
32022 247096 delayed_job.4
32029 206352 delayed_job.2
32047 232748 delayed_job.7
32061 228128 delayed_job.10

If I immediately do a Capistrano restart, without doing a deploy or anything else, then after a minute it looks like this.

ps -e -www -o pid,rss,command | grep delayed
 9884 198076 delayed_job.10
 9895 195372 delayed_job.0
 9919 196856 delayed_job.6
 9948 196772 delayed_job.5
 9964 196568 delayed_job.9
 9973 194092 delayed_job.12
 9982 195648 delayed_job.13
 9997 196392 delayed_job.2
10005 195356 delayed_job.4
10016 197268 delayed_job.3
10032 198820 delayed_job.8
10054 194316 delayed_job.7
10078 196780 delayed_job.11
10127 202420 delayed_job.1
10133 197468 delayed_job.1
10145 194040 delayed_job.1
10158 195760 delayed_job.1
10173 195844 delayed_job.1

And after another restart:

ps -e -www -o pid,rss,command | grep delayed
 9884 221780 delayed_job.10
 9973 225100 delayed_job.12
 9982 224708 delayed_job.13
10078 235076 delayed_job.11
21467 187056 delayed_job.0
21483 187844 delayed_job.7
21497 189648 delayed_job.10
21509 187316 delayed_job.2
21518 188180 delayed_job.11
21527 187968 delayed_job.3
21542 187852 delayed_job.12
21546 186900 delayed_job.13
21556 188628 delayed_job.5
21565 187816 delayed_job.9
21574 185216 delayed_job.4
21585 188088 delayed_job.1
21599 188556 delayed_job.1
21602 188400 delayed_job.1
21615 193484 delayed_job.1
21628 193288 delayed_job.8
21632 188228 delayed_job.1
21643 187804 delayed_job.6

As you can see these duplicate processes sometimes have new pids (eg. all from the first dump to the second) but sometimes don't (eg. DJ 10 from the 2nd to the 3rd).

I don't really know where to start debugging this. God isn't giving any errors when restarting and the DJ logs just show the usual output when launching a process. And the same thing isn't happening on a smaller server that is only meant to have 4 workers running (but is otherwise identical).

Has anyone seen this before?


Solution

  • I think this must be an issue in the daemons gem that delayed_job job uses for working in the background, because adding this at the top of my God file seems to have fixed things:

    ids = ('a'..'z').to_a
    workers.times do |num|
      num = ids[num]
    

    It seems like there was an issue where the processes named delayed_job.1 and delayed_job.11 (etc) would clash which would cause lots of problems. I haven't really isolated it down too far, but changing it to a different naming convention (delayed_job.a in this case) has fixed things for me now.

    Will leave this open in case someone has a better solution/a reason for why this worked.