Search code examples
ruby-on-railsdelayed-job

delayed_job stops running after some time in production


In production, our delayed_job process is dying for some reason. I'm not sure if it's crashing or being killed by the operating system or what. I don't see any errors in the delayed_job.log file.

What can I do to troubleshoot this? I was thinking of installing monit to monitor it, but that will only tell me precisely when it dies. It won't really tell me why it died.

Is there a way to make it more chatty to the log file, so I can tell why it might be dying?

Any other suggestions?


Solution

  • I've come across two causes of delayed_job failing silently. The first is actual segfaults when people were using libxml in forked processes (this popped up on the mailing list some time back).

    The second is an issue to do with the 1.1.0 version of daemons that delayed_job relies on has a problem (https://github.com/collectiveidea/delayed_job/issues#issue/81), this can be easily worked around by using 1.0.10 which is what my own Gemfile has in it.

    Logging

    There is logging in delayed_job so if the worker is dying without printing an error it's usually because it's not throwing an exception (e.g. Segfault) or something external is killing the process.

    Monitoring

    I use bluepill to monitor my delayed job instances, and so far this has been very successful at ensuring that the jobs remain running. The steps to get bluepill running for an application are quite easy

    Add the bluepill gem to your Gemfile:

     # Monitoring
      gem 'i18n' # Not sure why but it complained I didn't have it
      gem 'bluepill'
    

    I created a bluepill config file:

    app_home = "/home/mi/production"
    workers = 5
    Bluepill.application("mi_delayed_job", :log_file => "#{app_home}/shared/log/bluepill.log") do |app|
      (0...workers).each do |i|
        app.process("delayed_job.#{i}") do |process|
          process.working_dir = "#{app_home}/current"
    
          process.start_grace_time    = 10.seconds
          process.stop_grace_time     = 10.seconds
          process.restart_grace_time  = 10.seconds
    
          process.start_command = "cd #{app_home}/current && RAILS_ENV=production ruby script/delayed_job start -i #{i}"
          process.stop_command  = "cd #{app_home}/current && RAILS_ENV=production ruby script/delayed_job stop -i #{i}"
    
          process.pid_file = "#{app_home}/shared/pids/delayed_job.#{i}.pid"
          process.uid = "mi"
          process.gid = "mi"
        end
      end
    end
    

    Then in my capistrano deploy file I just added:

    # Bluepill related tasks
    after "deploy:update", "bluepill:quit", "bluepill:start"
    namespace :bluepill do
      desc "Stop processes that bluepill is monitoring and quit bluepill"
      task :quit, :roles => [:app] do
        run "cd #{current_path} && bundle exec bluepill --no-privileged stop"
        run "cd #{current_path} && bundle exec bluepill --no-privileged quit"
      end
    
      desc "Load bluepill configuration and start it"
      task :start, :roles => [:app] do
        run "cd #{current_path} && bundle exec bluepill --no-privileged load /home/mi/production/current/config/delayed_job.bluepill"
      end
    
      desc "Prints bluepills monitored processes statuses"
      task :status, :roles => [:app] do
        run "cd #{current_path} && bundle exec bluepill --no-privileged status"
      end
    end
    

    Hope this helps a little.