Search code examples
ruby-on-railscsvscalabilitydelayed-jobruby-on-rails-3.2

Ruby-on-Rails 3.2: Export a CSV with a large data set (100,000 records)


Introduction

I have a app that has multiple tables, some with and some without associations.

Some tables will be required to hold around 100,000 entries.

The app is using Rails 3.2 on Ruby 1.9 and hosted on Heroku. I have access to workers if needed.

The Requirement in Question

A vital requirement for the app is to allow users to export the data as a CSV - a requirement for this is to allow the user to filter what data they want to export but I am not worrying about that at the moment as you will see from the data below, I have hard-coded what data is to be exported but this does rule out creating a rake task to just export the whole table.

Also the method implemented must be factored to allow to be used by multiple tables to avoid unnessaccery code repeating.

Current Solution

I am implementing delayed_job in my app and performing a CSV generate in a job. While doing this I am following the solution found here at http://www.ayokasystems.com/blog/delegating-long-running-jobs-in-rails/ from 'abdullah'.

The idea being is to generate the data in CSV format and save it in a LONGTEXT field in the UserJobs table to allow the user to download once complete and at a future time.

The Problem

The method used from the above tutorial works fine in my app until I run a job of 100,000 records at once. To overcome this I have tried to add the cool find_each function to the perform method but delayed job worker reports back a error everytime it tries to process it:

[Worker(host:*** pid:18637)] ReportJob failed with NoMethodError: undefined method `each' for #<Title:0x007ff20c1ec1b0> - 0 failed attempts
[Worker(host:*** pid:18637)] ReportJob failed with NoMethodError: undefined method `each' for #<Title:0x007ff20ec47f18> - 1 failed attempts
[Worker(host:*** pid:18637)] 2 jobs processed at 10.5219 j/s, 2 failed ... 

My code for the perform method is:

def perform
  Title.find_each do |titles|
    csv_data = CSV.generate do |csv|
      titles.each do |t|
        csv << t.to_csv
      end
    end
    user_job = UserJob.find(user_job_id)
    user_job.update_attribute :data, csv_data
  end
end

Can anyone see what the issue could be, I am thinking I have just made a stupid mistake with how I am looping through things.

I am very open to any other suggestions to how to accomplish the requirement in question but please bear in mind the limitations I have with Heroku.


Solution

  • You are trying to iterate with each but titles in this case are instances of title (not array).

    csv_vals = []
    columns = [:name, :release_date, :studio]
    
    Title.find_each(:select => columns) do |title| 
      columns.each {|value| csv_vals << "#{title[value]}"}
    end
    
    # comma separated string 
    csv_string = csv_vals.join(',')
    

    There is more elegant way to formulate CSV string but I am too lazy to experiment.

    What is important is that you are doing SELECT only on columns you need. For 100 000 records that adds to a lot of less bandwidth DB communication. With just find_each you get all columns for each row and you don't need them.