Search code examples
ruby-on-railspostgresqlherokuredissidekiq

Optimising Sidekiq, Redis, Heroku and Rails


So I'm trying to process a CSV file via Sidekiq background job processing on a Heroku Worker instance. Whilst I can complete the process, I feel it could certainly be done quicker/more efficiently than I'm currently doing it. This question contains two parts - firstly are the database pools setup correctly and secondly how can I optimise the process.

Application environment:

  • Rails 4 application
  • Unicorn
  • Sidekiq
  • Redis-to-go (Mini plan, 50 connections max)
  • CarrierWave S3 implementation
  • Heroku Postgres (Standard Yanari, 60 connections max)
  • 1 Heroku Web dyno
  • 1 Heroku Worker dyno
  • NewRelic monitoring

config/unicorn.rb

worker_processes 3
timeout 15
preload_app true

before_fork do |server, worker|
  Signal.trap 'TERM' do
    puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
    Process.kill 'QUIT', Process.pid
  end

  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection.disconnect!
  end
end

after_fork do |server, worker|
  Signal.trap 'TERM' do
    puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
  end

  if defined?(ActiveRecord::Base)
      config = ActiveRecord::Base.configurations[Rails.env] ||
                  Rails.application.config.database_configuration[Rails.env]
      config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
      config['pool']            =   ENV['DB_POOL'] || 2
      ActiveRecord::Base.establish_connection(config)
    end
end

config/sidekiq.yml

---
:concurrency: 5
staging:
  :concurrency: 5
production:
  :concurrency: 35
:queues:
  - [default, 1]
  - [imports, 10]
  - [validators, 10]
  - [send, 5]
  - [clean_up_tasks, 30]
  - [contact_generator, 20]

config/initializers/sidekiq.rb

ENV["REDISTOGO_URL"] ||= "redis://localhost:6379"

Sidekiq.configure_server do |config|
  config.redis = { url: ENV["REDISTOGO_URL"] }

  database_url = ENV['DATABASE_URL']
  if database_url
    ENV['DATABASE_URL'] = "#{database_url}?pool=50"
    ActiveRecord::Base.establish_connection
  end

end


Sidekiq.configure_client do |config|
  config.redis = { url: ENV["REDISTOGO_URL"] }
end

The database connection pools are worked out as such:

I have 3 Web processes (unicorn worker_processes), to each of these I am allocating 2 ActiveRecord connections via the after_fork hook (config/unicorn.rb) for 6 total (maximum) of my 60 available Postgres connections assigned to the Web dyno. In the Sidekiq initialiser, I'm allocating 50 Postgres connections via the ?pool=50 param appended to ENV['DATABASE_URL'] as described (somewhere) in the docs. I'm keeping my Sidekiq concurrency value at 35 (sidekiq.yml) to ensure I stay under both the 50 Redis connections and 60 Postgres connection limits. This still needs more finely grained tuning, but I'd rather get the data processing itself sorted before going any further with this.

Now, assuming the above is correct (and it wouldn't surprise me at all if it weren't) I'm handling the following scenario:

A user uploads a CSV file to be processed via their browser. This file can be anywhere between 50 rows and 10 million rows. The file is uploaded to S3 via the CarrierWave gem.

The user then configures a couple of settings for the import via the UI,the culmination of which adds an FileImporter job to the Sidekiq queue to start creating various models based on the rows.

The Import worker looks something like:

class FileImporter
  include Sidekiq::Worker
  sidekiq_options :queue => :imports

  def perform(import_id)
    import = Import.find_by_id import_id

    CSV.foreach(open(import.csv_data), headers: true) do |row| 
      # import.csv_data is the S3 URL of the file

      # here I do some validation against a prebuilt redis table  
      # to validate the row without making any activerecord calls
      # (business logic validations rather than straight DB ones anyway)        

      unless invalid_record # invalid_record being the product of the previous validations

        # queue another job to actually create the AR models for this row
        ImportValidator.perform_async(import_id, row)

        # increment some redis counters
      end
    end
  end

This is slow - I've tried to limit the calls to ActiveRecord in the FileImporter worker so I'm assuming it's because I'm streaming the file from S3. It's not processing rows fast enough to build up a queue so I'm never utilising all of my worker threads (usually somewhere between 15-20 of the 35 available threads are active. I've tried splitting this job up and feeding rows a 100 at a time into an intermediary worker which then creates the ImportValidator jobs in a more parallel fashion but that didn't fare much better.

So my question is, what's the best/most efficient method to accomplish a task like this?


Solution

  • It's possible you are at 100% CPU with 20 threads. You need another dyno.