I want to perform batch operations on millions of records from DB.
According to the ActiveRecord documentation, there are two methods to perform batch operations, namely #find_in_batches
& #in_batches
. But, I can't seem to find any difference between them except that one returns an Enumerator
and the other an ActiveRecord Relation
.
So, considering they have different performance, I want to know which performs better in which scenario. And, is there any better way to conditionally update millions of rows except the raw SQL ?
In short, find_in_batches
yields each batch of records that was found while in_batches
yields ActiveRecord::Relation objects.
So the following code:
Post.find_in_batches do |group|
group.each { |post| puts post.title }
end
Will only send one query per batch to database to retrieve all posts' data for the batch:
SELECT "posts".* FROM "posts" WHERE ...
However:
Post.in_batches do |group|
group.each { |post| puts post.title }
end
Will send two queries per batch to database. The first query to get posts' ids for the batch:
SELECT "posts"."id" FROM "posts" WHERE ...
And the second query to get all posts' data for the batch:
SELECT "posts".* FROM "posts" WHERE ...
More details:
If you look in to the source code for those two functions here, you will see that find_in_batches
actually calls in_batches
with load: true
passed in the argument. However the default value for load
is false
in in_batches
.
And if you look further in the in_batches
for the part that uses value of load
, it will look like this:
if load
records = batch_relation.records
ids = records.map(&:id)
yielded_relation = where(primary_key => ids)
yielded_relation.load_records(records)
else
ids = batch_relation.pluck(primary_key)
yielded_relation = where(primary_key => ids)
end
Original explanation can be found here: https://www.codehub.com.vn/Difference-between-find_in_batches-vs-in_batches-in-Ruby-on-Rails