I have a dataset with 300 Million entries and I need to insert it in a PostgreSQL database. The data is in the form of gzipped JSON lines files on Amazon's S3. Each file has 50k JSON lines and each JSON line is about 35kb.
Right now I’m creating a job on Sidekiq for each of the JSON lines files on S3 (~5500). I have workers on Heroku (standard-2x workers) processing these files. Workers download the JSON lines file, parse the lines and start persisting them via ActiveRecord to the database (doing transactions of 1000 at a time). Right now I’m finding that each worker can persist around 2500 rows per minute. I'm also finding that if I increase the number of workers significantly (e.g. 50) the number of entries each worker inserts per minute goes down (my database should be able to handle up to 200 connections).
I was hoping I could make this faster. Any pointers for improving performance?
This is the logic in the Sidekiq job:
# entries is an array of 50k strings, where each string is a JSON object
entries = EntriesDataService.get_entries(s3_url)
entries.each_slice(1000) do |chunk|
ActiveRecord::Base.transaction do
chunk.each {|p| Model.from_json_string(p)}
end
end
You can perform a multi insert for each chunk like this:
entries.each_slice(1000) do |chunk|
values = get_values_from_chunk(chunk)
query = "INSERT INTO table (col1, col2, ...) VALUES #{values}"
ActiveRecord::Base.connection.execute(query)
end
get_values_from_chunk method has to return a set of values as string, like for example:
values = "(col1v1, col2v1, ...), (col1v2, col2v2, ...), ..., (col1vn, col2vn, ...)"
In this way insertions will be hugely improved.