Search code examples
rubycsvmemorylarge-filesenumerable

Choose starting row for CSV.foreach or similar method? Don't want to load file into memory


Edit (I adjusted the title): I am currently using CSV.foreach but that starts at the first row. I'd like to start reading a file at an arbitrary line without loading the file into memory. CSV.foreach works well for retrieving data at the beginning of a file but not for data I need towards the end of a file.

This answer is similar to what I am looking to do but it loads the entire file into memory; which is what I don't want to do.

I have a 10gb file and the key column is sorted in ascending order:

# example 10gb file rows
key,state,name
1,NY,Jessica
1,NY,Frank
1,NY,Matt
2,NM,Jesse
2,NM,Saul
2,NM,Walt
etc..

I find the line I want to start with this way ...

file = File.expand_path('~/path/10gb_file.csv')

File.open(file, 'rb').each do |line|
  if line[/^2,/]
    puts "#{$.}: #{line}" # 5: 2,NM,Jesse
    row_number = $. # 5
    break
  end
end

... and I'd like to take row_number and do something like this but not load the 10gb file into memory:

CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }

Lastly, I'm currently handling it like the next snippet; It works fine when the rows are towards the front of the file but not when they're near the end.

CSV.foreach(file, headers: true) do |row|
  next if row['key'].to_i < row_number.to_i
  break if row['key'].to_i > row_number.to_i

  "..load data..."
end

I am trying to use CSV.foreach but I'm open to suggestions. An alternative approach I am considering but does not seem to be efficient for numbers towards the middle of a file:

  • Use IO or File and read the file line by line
  • Get the header row and build the hash manually
  • Read the file from the bottom for numbers near the max key value

Solution

  • I think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:

    begin
      file = File.open(FILENAME)
    
      # Get the headers from the first line
      headers = CSV.parse_line(file.gets)
    
      # Seek in the file until we find a matching line
      match = "2,"
      while line = file.gets
        break if line.start_with?(match)
      end
    
      # Rewind the cursor to the beginning of the line
      file.seek(-line.size, IO::SEEK_CUR)
    
      csv = CSV.new(file, headers: headers)
    
      # ...do whatever you want...
    ensure
      # Don't forget the close the file
      file.close
    end
    

    The result of the above is that csv will be a CSV object whose first row is the row that starts with 2,.

    I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7

    Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.