Search code examples
ruby-on-railspostgresqlrake

How to migrate data in batches via rake task in postgres?


I have the following rake task:

namespace :backfill do
 desc "backfill device id data"
 task device_ids: :environment do
   user_count = User.where.not(device_id: nil).count

   puts "Begin device_id backfill for #{user_count} users"

   batch_size, offset = 2000, 0
   begin
     puts "Beginning backfill batch from rows #{offset} to #{offset + batch_size} out of #{user_count} rows"

    ActiveRecord::Base.connection.execute <<-SQL
      INSERT INTO user_device_infos (user_id, user_device_id, last_login_at, created_at, updated_at)
      SELECT users.id, users.device_id, users.last_sign_in_at, current_timestamp, current_timestamp
      FROM users
      LEFT JOIN user_device_infos
      ON users.id = user_device_infos.user_id
      AND users.device_id = user_device_infos.user_device_id
      WHERE users.device_id IS NOT NULL
      AND user_device_infos.user_id IS NULL
      ORDER BY users.created_at DESC
      LIMIT #{batch_size}
      OFFSET #{offset}
    SQL
    offset += batch_size
  end until offset > user_count

  puts "backfill complete"
end

end

This job is copying data from a from a users header table into a device info sub table. It is running on a production db, thus the mini-batches to prevent locking. It has a few problems.

  1. It mysteriously drops records, completing without copying all the required data. Not sure why.
  2. It is very slow. It is running on about 2 million records and takes over an hour, I need to speed it up.
  3. It needs to be idempotent.

How can I

  1. makes sure it is running on only records that have not already been migrated
  2. speed it up reasonably (increase batch size, or maybe use something besides limit/offset for mini-batches)
  3. verify that it has completed successfully

Solution

  • The problem was that I didn't need the offset in the sql clause as the left join already filtered out new records. Removing the offset both sped up and fixed the job.

    Saving the sql output to a variable and printing puts result.cmd_status helped verify that each batch was done successfully.