Search code examples
rubyexport-to-csv

How to "observe" a stream in Ruby's CSV module?


I am writing a class that takes a CSV files, transforms it, and then writes the new data out.

module Transformer
  class Base
    def initialize(file)
      @file = file
    end

    def original_data(&block)
      opts = { headers: true }
      CSV.open(file, 'rb', opts, &block)
    end

    def transformer
      # complex manipulations here like modifying columns, picking only certain
      # columns to put into new_data, etc but simplified to `+10` to keep
      # example concise
      -> { |row| new_data << row['some_header'] + 10 }
    end

    def transformed_data
      self.original_data(self.transformer) 
    end

    def write_new_data
      CSV.open('new_file.csv', 'wb', opts) do |new_data|
        transformed_data
      end
    end
  end
end

What I'd like to be able to do is:

  • Look at the transformed data without writing it out (so I can test that it transforms the data correctly, and I don't need to write it to file right away: maybe I want to do more manipulation before writing it out)
  • Don't slurp all the file at once, so it works no matter the size of the original data
  • Have this as a base class with an empty transformer so that instances only need to implement their own transformers but the behavior for reading and writing is given by the base class.

But obviously the above doesn't work because I don't really have a reference to new_data in transformer.

How could I achieve this elegantly?


Solution

  • I can recommend one of two approaches, depending on your needs and personal taste.

    I have intentionally distilled the code to just its bare minimum (without your wrapping class), for clarity.

    1. Simple read-modify-write loop

    Since you do not want to slurp the file, use CSV::Foreach. For example, for a quick debugging session, do:

    CSV.foreach "source.csv", headers: true do |row|
      row["name"] = row["name"].upcase
      row["new column"] = "new value"
      p row
    end
    

    And if you wish to write to file during that same iteration:

    require 'csv'
    
    csv_options = { headers: true }
    
    # Open the target file for writing
    CSV.open("target.csv", "wb") do |target|
      # Add a header
      target << %w[new header column names]
    
      # Iterate over the source CSV rows
      CSV.foreach "source.csv", **csv_options do |row|
        # Mutate and add columns
        row["name"] = row["name"].upcase
        row["new column"] = "new value"
    
        # Push the new row to the target file
        target << row
      end
    end
    

    2. Using CSV::Converters

    There is a built in functionality that might be helpful - CSV::Converters - (see the :converters definition in the CSV::New documentation)

    require 'csv'
    
    # Register a converter in the options hash
    csv_options = { headers: true, converters: [:stripper] }
    
    # Define a converter
    CSV::Converters[:stripper] = lambda do |value, field|
      value ? value.to_s.strip : value
    end
    
    CSV.open("target.csv", "wb") do |target|
      # same as above
    
      CSV.foreach "source.csv", **csv_options do |row|
        # same as above - input data will already be converted
        # you can do additional things here if needed
      end
    end
    

    3. Separate input and output from your converter classes

    Based on your comment, and since you want to minimize I/O and iterations, perhaps extracting the read/write operations from the responsibility of the transformers might be of interest. Something like this.

    require 'csv'
    
    class NameCapitalizer
      def self.call(row)
        row["name"] = row["name"].upcase
      end
    end
    
    class EmailRemover
      def self.call(row)
        row.delete 'email'
      end
    end
    
    csv_options = { headers: true }
    converters = [NameCapitalizer, EmailRemover]
    
    CSV.open("target.csv", "wb") do |target|
      CSV.foreach "source.csv", **csv_options do |row|
        converters.each { |c| c.call row }
        target << row
      end
    end
    

    Note that the above code still does not handle the header, in case it was changed. You will probably have to reserve the last row (after all transformations) and prepend its #headers to the output CSV.

    There are probably plenty other ways to do it, but the CSV class in Ruby does not have the cleanest interface, so I try to keep code that deals with it as simple as I can.