Edit (I adjusted the title): I am currently using CSV.foreach
but that starts at the first row. I'd like to start reading a file at an arbitrary line without loading the file into memory. CSV.foreach
works well for retrieving data at the beginning of a file but not for data I need towards the end of a file.
This answer is similar to what I am looking to do but it loads the entire file into memory; which is what I don't want to do.
I have a 10gb file and the key
column is sorted in ascending order:
# example 10gb file rows
key,state,name
1,NY,Jessica
1,NY,Frank
1,NY,Matt
2,NM,Jesse
2,NM,Saul
2,NM,Walt
etc..
I find the line I want to start with this way ...
file = File.expand_path('~/path/10gb_file.csv')
File.open(file, 'rb').each do |line|
if line[/^2,/]
puts "#{$.}: #{line}" # 5: 2,NM,Jesse
row_number = $. # 5
break
end
end
... and I'd like to take row_number
and do something like this but not load the 10gb file into memory:
CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }
Lastly, I'm currently handling it like the next snippet; It works fine when the rows are towards the front of the file but not when they're near the end.
CSV.foreach(file, headers: true) do |row|
next if row['key'].to_i < row_number.to_i
break if row['key'].to_i > row_number.to_i
"..load data..."
end
I am trying to use CSV.foreach
but I'm open to suggestions. An alternative approach I am considering but does not seem to be efficient for numbers towards the middle of a file:
IO
or File
and read the file line by linekey
valueI think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:
begin
file = File.open(FILENAME)
# Get the headers from the first line
headers = CSV.parse_line(file.gets)
# Seek in the file until we find a matching line
match = "2,"
while line = file.gets
break if line.start_with?(match)
end
# Rewind the cursor to the beginning of the line
file.seek(-line.size, IO::SEEK_CUR)
csv = CSV.new(file, headers: headers)
# ...do whatever you want...
ensure
# Don't forget the close the file
file.close
end
The result of the above is that csv
will be a CSV object whose first row is the row that starts with 2,
.
I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach
alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7
Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.