So I'm trying to process a CSV file via Sidekiq background job processing on a Heroku Worker instance. Whilst I can complete the process, I feel it could certainly be done quicker/more efficiently than I'm currently doing it. This question contains two parts - firstly are the database pools setup correctly and secondly how can I optimise the process.
worker_processes 3
timeout 15
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
end
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
end
if defined?(ActiveRecord::Base)
config = ActiveRecord::Base.configurations[Rails.env] ||
Rails.application.config.database_configuration[Rails.env]
config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
config['pool'] = ENV['DB_POOL'] || 2
ActiveRecord::Base.establish_connection(config)
end
end
---
:concurrency: 5
staging:
:concurrency: 5
production:
:concurrency: 35
:queues:
- [default, 1]
- [imports, 10]
- [validators, 10]
- [send, 5]
- [clean_up_tasks, 30]
- [contact_generator, 20]
ENV["REDISTOGO_URL"] ||= "redis://localhost:6379"
Sidekiq.configure_server do |config|
config.redis = { url: ENV["REDISTOGO_URL"] }
database_url = ENV['DATABASE_URL']
if database_url
ENV['DATABASE_URL'] = "#{database_url}?pool=50"
ActiveRecord::Base.establish_connection
end
end
Sidekiq.configure_client do |config|
config.redis = { url: ENV["REDISTOGO_URL"] }
end
The database connection pools are worked out as such:
I have 3 Web processes (unicorn worker_processes), to each of these I am allocating 2 ActiveRecord connections via the after_fork hook (config/unicorn.rb) for 6 total (maximum) of my 60 available Postgres connections assigned to the Web dyno. In the Sidekiq initialiser, I'm allocating 50 Postgres connections via the ?pool=50 param appended to ENV['DATABASE_URL'] as described (somewhere) in the docs. I'm keeping my Sidekiq concurrency value at 35 (sidekiq.yml) to ensure I stay under both the 50 Redis connections and 60 Postgres connection limits. This still needs more finely grained tuning, but I'd rather get the data processing itself sorted before going any further with this.
Now, assuming the above is correct (and it wouldn't surprise me at all if it weren't) I'm handling the following scenario:
A user uploads a CSV file to be processed via their browser. This file can be anywhere between 50 rows and 10 million rows. The file is uploaded to S3 via the CarrierWave gem.
The user then configures a couple of settings for the import via the UI,the culmination of which adds an FileImporter job to the Sidekiq queue to start creating various models based on the rows.
The Import worker looks something like:
class FileImporter
include Sidekiq::Worker
sidekiq_options :queue => :imports
def perform(import_id)
import = Import.find_by_id import_id
CSV.foreach(open(import.csv_data), headers: true) do |row|
# import.csv_data is the S3 URL of the file
# here I do some validation against a prebuilt redis table
# to validate the row without making any activerecord calls
# (business logic validations rather than straight DB ones anyway)
unless invalid_record # invalid_record being the product of the previous validations
# queue another job to actually create the AR models for this row
ImportValidator.perform_async(import_id, row)
# increment some redis counters
end
end
end
This is slow - I've tried to limit the calls to ActiveRecord in the FileImporter worker so I'm assuming it's because I'm streaming the file from S3. It's not processing rows fast enough to build up a queue so I'm never utilising all of my worker threads (usually somewhere between 15-20 of the 35 available threads are active. I've tried splitting this job up and feeding rows a 100 at a time into an intermediary worker which then creates the ImportValidator jobs in a more parallel fashion but that didn't fare much better.
So my question is, what's the best/most efficient method to accomplish a task like this?
It's possible you are at 100% CPU with 20 threads. You need another dyno.