Search code examples
ruby-on-railsrubycsvsidekiq

optimizing reading database and writing to csv file


I'm trying to read a large amount of cells from database (over 100.000) and write them to a csv file on VPS Ubuntu server. It happens that server doesn't have enough memory.

I was thinking about reading 5000 rows at once and writing them to file, then reading another 5000, etc..

How should I restructure my current code so that memory won't be consumed fully?

Here's my code:

def write_rows(emails)

  File.open(file_path, "w+") do |f|
    f << "email,name,ip,created\n"
    emails.each do |l|
      f << [l.email, l.name, l.ip, l.created_at].join(",") + "\n"
    end
  end
end

The function is called from sidekiq worker by:

write_rows(user.emails)

Thanks for help!


Solution

  • The problem here is that when you call emails.each ActiveRecord loads all the records from the database and keeps them in memory, to avoid this you can use the method find_each:

    require 'csv'
    
    BATCH_SIZE = 5000
    
    def write_rows(emails)
      CSV.open(file_path, 'w') do |csv|
    
        csv << %w{email name ip created}
    
        emails.find_each do |email|
          csv << [email.email, email.name, email.ip, email.created_at]
        end
      end
    end
    

    By default find_each loads records in batches of 1000 at a time, if you want to load batches of 5000 record you have to pass the option :batch_size to find_each:

    emails.find_each(:batch_size => 5000) do |email|
      ...
    

    More information about the find_each method (and the related find_in_batches) can be found on the Ruby on Rails Guides.

    I've used the CSV class to write the file instead of joining fields and lines by hand. This is not inteded to be a performance optimization since writing on the file shouldn't be the bottleneck here.