Search code examples
rubyshellcsvfastercsv

Using Ruby to replace specific records in a CSV file with records from another CSV file


I have two large-ish CSV files. One file is just a list of records. The other file is a list of records, but the first column is the line number of the record that it modifies in the other file. It doesn't replace the whole row; It just replaces the value in the row that has the matching header.

For example:

File 1:

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Perl"
"Dane","Joe","Lisp"

File 2:

"Seq","Lang"
2,"Ruby"

The goal is to end up with one file that looks like this:

"First","Last","Lang"
"John","Doe","Ruby"
"Jane","Doe","Ruby"
"Dane","Joe","Lisp"

The data is, however, much more complicated than that and could even contain line breaks in the CSV. Thus, I can't rely on the line number and instead I have to rely on the record count. (Unless, of course, I preprocess both files to replace newlines and carriage returns.. which I suppose is possible but less interesting.)

The question I have is how do I loop through both files and do the proper replacement without loading either of the entire files into memory. I believe 100mb+ files loaded into memory is a bad idea, right?

Also, the records in the resulting file should be in the same order when it's done.


Solution

  • You will need 2 enumerators, but since they're not nested, one will need to use Enumerator#next, which means you need to be careful about it raising a EOF exception:

    e = CSV.open('file2.csv', :headers => true).each
    seq = e.next
    
    output = CSV.open('output.csv', 'w')
    
    csv = CSV.open('file1.csv')
    csv.each do |row|
      if seq['Seq'].to_i == csv.lineno - 1
        row[2] = seq['Lang']
        seq = e.next rescue ({'Seq' => -1})
      end
      output << row
    end