Search code examples
ruby-on-railsactiverecord

Active Record in_batches vs find_in_batches


I want to perform batch operations on millions of records from DB.

According to the ActiveRecord documentation, there are two methods to perform batch operations, namely #find_in_batches & #in_batches. But, I can't seem to find any difference between them except that one returns an Enumerator and the other an ActiveRecord Relation.

So, considering they have different performance, I want to know which performs better in which scenario. And, is there any better way to conditionally update millions of rows except the raw SQL ?


Solution

  • In short, find_in_batches yields each batch of records that was found while in_batches yields ActiveRecord::Relation objects.

    So the following code:

    Post.find_in_batches do |group|
       group.each { |post| puts post.title }
    end
    

    Will only send one query per batch to database to retrieve all posts' data for the batch:

    SELECT "posts".* FROM "posts" WHERE ...
    

    However:

    Post.in_batches do |group|
       group.each { |post| puts post.title }
    end
    

    Will send two queries per batch to database. The first query to get posts' ids for the batch:

    SELECT "posts"."id" FROM "posts" WHERE ...
    

    And the second query to get all posts' data for the batch:

    SELECT "posts".* FROM "posts" WHERE ...
    

    More details:

    If you look in to the source code for those two functions here, you will see that find_in_batches actually calls in_batches with load: true passed in the argument. However the default value for load is false in in_batches.

    And if you look further in the in_batches for the part that uses value of load, it will look like this:

            if load
              records = batch_relation.records
              ids = records.map(&:id)
              yielded_relation = where(primary_key => ids)
              yielded_relation.load_records(records)
            else
              ids = batch_relation.pluck(primary_key)
              yielded_relation = where(primary_key => ids)
            end
    

    Original explanation can be found here: https://www.codehub.com.vn/Difference-between-find_in_batches-vs-in_batches-in-Ruby-on-Rails