ruby-on-rails ruby csv activerecord activerecord-import

CSV Parse Taking Too Much Memory

I'm trying to read a 5MM line file and right now it is exceeding my alotted memory usage on heroku. My method is somewhat fast ~200 inserts/second.. I believe it's crashing on the import.. So my plan was to import in batches of 1,000 or 10,000. My question is how do I tell that I'm at the end of my file, ruby has a .eof method but its a File method, and I'm not sure how to call it in my loop

    def self.import_parts_db(file)
        time = Benchmark.measure do
            Part.transaction do 
                parts_db = []
                CSV.parse(File.read(file), headers: true) do |row|
                    row_hash = row.to_hash
                    part = Part.new(
                        part_num: row_hash["part_num"], 
                        description: row_hash["description"], 
                        manufacturer: row_hash["manufacturer"],
                        model: row_hash["model"],
                        cage_code: row_hash["cage_code"],
                        nsn: row_hash["nsn"]
                        )
                    parts_db << part
                end
                Part.import parts_db
            end
        end
        puts time
    end

Solution

1st problem

As soon as you use File.read(file) with a huge file, your script will use a lot of memory (possibly too much). You read the entire file into 1 huge string, even though CSV reads it line by line.

It might work fine when you use files with thousands of rows. Still, you should use CSV.foreach. Change

 CSV.parse(File.read(file), headers: true) do |row|

CSV.foreach(file, headers: true) do |row|

In this example, memory usage went from 1GB to 0.5MB.

2nd problem

parts_db becomes a huge Array of Parts, which keeps growing until the very end of the CSV file. You need to either remove the transaction (import will be slow but won't require more memory than for 1 row) or process the CSV in batches.

Here's one possibility to do it. We use CSV.parse again, but only with batches of 2000 lines :

def self.import_parts_db(filename)
  time = Benchmark.measure do
    File.open(filename) do |file|
      headers = file.first
      file.lazy.each_slice(2000) do |lines|
        Part.transaction do
          rows = CSV.parse(lines.join, write_headers: true, headers: headers)
          parts_db = rows.map do |_row|
            Part.new(
              part_num: row_hash['part_num'],
              description: row_hash['description'],
              manufacturer: row_hash['manufacturer'],
              model: row_hash['model'],
              cage_code: row_hash['cage_code'],
              nsn: row_hash['nsn']
            )
          end
          Part.import parts_db
        end
      end
    end
    puts time
  end
end

3rd problem?

The previous answer shouldn't use much memory, but it still could take a long time to import everything, possibly too much for a remote server.

The advantage of using an Enumerator is that it's easy to skip batches, and get just the ones you want.

Let's say your import takes too long, and stops for some reason after 424000 successful imports.

You could replace :

file.lazy.each_slice(2000) do |lines|

file.lazy.drop(424_000).take(300_000).each_slice(2000) do |lines|

To skip the first 424000 CSV lines, and parse the next 300000 ones.

For the next import, use :

file.lazy.drop(424_000+300_000).take(300_000).each_slice(2000) do |lines|

and then :

file.lazy.drop(424_000+2*300_000).take(300_000).each_slice(2000) do |lines|

...