I have a database table holding 500k records. I need to load these records, do an API call, fetch data from this API endpoint, and save new data back to the database.
I tried to do it like this (simplified code):
task get_new_data: :environment do
data = Model.where('column IS NOT NULL).order('id DESC')
data.each do |d|
puts '...info print...'
api_call = API::Call(...)
d.assign_attributes(attr1: api_call.data1,
attr2: api_call.data2,
...)
d.save!(validate: false)
put 'another info print...'
sleep(0.5)
end
end
So what happened when I ran this code was that maybe for the first 3-4 minutes, I didn't see anything in the terminal window when I ran this rake command on the server. After that, I started seeing the puts
statements. In another few minutes (maybe 5), the terminal window froze. I could see in the database table that the records were updated - so I thought that it's only the terminal window that "got broken".
After another 15-20 minutes, the server's CPU and RAM memory went up to 100% usage and I was not able to ssh to the server (that was expected due to the CPU/RAM 100% usage). Similarly, the Rails app was unreachable - blue error screen (expected as well).
At this point, I was just waiting when the server will fully crash. After another 30mins or so, the CPU/RAM memory got released, the Rails app was again reachable - the rake task finished. But, it didn't finish completely, it processed "only" about 350k records, so I had to run the rake task again to finish the remaining 150k records. Now, this second run was completed successfully.
The thing is that I'll need to run a similar task and iterate over the 500k records at least 6 more times, and I don't want to go through the similar experience as described above.
What is the right way to process such datasets without killing the server? Should I somehow batch the records by, say, 10k chunks? Or is there a better way to handle this thing differently? I currently don't use Sidekiq in the project.
Rails supports batching records, and it's easy to implement.
Instead of
data.each do |d|
Use
data.find_each do |d|
The batch size is limited to 1,000 records by default but you can override this with a :batch_size
argument.
Official documentation is here...
https://api.rubyonrails.org/classes/ActiveRecord/Batches.html