Search code examples
ruby-on-railspostgresqlquery-optimizationrails-activerecord

How should you backfill a new table in Rails?


I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.

What I've decided to do is create a big INSERT string for every 2000 users and execute that query.

Here's what the code roughly looks like:

task :backfill_my_new_table => :environment do
    inserts = []
    User.find_each do |user|
        tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
        inserts << tuple
    end

    # At this point, the inserts array is of size at least 20,000
    conn = ActiveRecord::Base.connection
    inserts.each_slice(2000) do |slice|
        sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
        conn.execute(sql)
    end
end

So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?

Thanks!


Solution

  • Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:

    • try to use COPY instead of INSERT whenever possible;
    • if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
    • disable indexes and checks/constraints on/of a target table;
    • disable table triggers;
    • alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded

    20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.

    UPD: and a little bit old, yet wonderful piece from depesz, excerpt:

    so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.