Search code examples
ruby-on-railsrubypostgresql

Processing 500k records crashes the server. How to batch it?


I have a database table holding 500k records. I need to load these records, do an API call, fetch data from this API endpoint, and save new data back to the database.

I tried to do it like this (simplified code):

task get_new_data: :environment do
  data = Model.where('column IS NOT NULL).order('id DESC')
  data.each do |d|
    puts '...info print...'
    api_call = API::Call(...)
    d.assign_attributes(attr1: api_call.data1,
                        attr2: api_call.data2,
                        ...)
    d.save!(validate: false)
    put 'another info print...'
    sleep(0.5)
  end
end

So what happened when I ran this code was that maybe for the first 3-4 minutes, I didn't see anything in the terminal window when I ran this rake command on the server. After that, I started seeing the puts statements. In another few minutes (maybe 5), the terminal window froze. I could see in the database table that the records were updated - so I thought that it's only the terminal window that "got broken".

After another 15-20 minutes, the server's CPU and RAM memory went up to 100% usage and I was not able to ssh to the server (that was expected due to the CPU/RAM 100% usage). Similarly, the Rails app was unreachable - blue error screen (expected as well).

At this point, I was just waiting when the server will fully crash. After another 30mins or so, the CPU/RAM memory got released, the Rails app was again reachable - the rake task finished. But, it didn't finish completely, it processed "only" about 350k records, so I had to run the rake task again to finish the remaining 150k records. Now, this second run was completed successfully.

The thing is that I'll need to run a similar task and iterate over the 500k records at least 6 more times, and I don't want to go through the similar experience as described above.

What is the right way to process such datasets without killing the server? Should I somehow batch the records by, say, 10k chunks? Or is there a better way to handle this thing differently? I currently don't use Sidekiq in the project.


Solution

  • Rails supports batching records, and it's easy to implement.

    Instead of

     data.each do |d|
    

    Use

     data.find_each do |d|
    

    The batch size is limited to 1,000 records by default but you can override this with a :batch_size argument.

    Official documentation is here...

    https://api.rubyonrails.org/classes/ActiveRecord/Batches.html