Search code examples
performancecrystal-lang

Removing duplicates from CSV -- performance problems


I have a CSV file which might look like this:

foo,bar,glib
"a","1","A"
"b","1","B"
"a","2","C"
"b","1","D"

I'm looping through that CSV file and want to remove any duplicate rows where foo and bar are the same, i.e. my resulting file should look like this:

foo,bar,glib
"a","1","A"
"b","1","B"
"a","2","C"

This is how I'm doing it:

require "csv"

File.open("input.csv") do |infile|
  reader = CSV.new(infile, header=true)
  File.open("output.csv", "w") do |outfile|
    printed_tuples = Array(Tuple(String, String)).new
    CSV.build(outfile) do |writer|
      while reader.next
        next if printed_tuples.includes?({reader.row["foo"], reader.row["bar"]})
        printed_tuples << {reader.row["foo"], reader.row["bar"]}
        writer.row reader.row.to_a
      end
    end
  end
end

The real CSV file is a lot bigger (386280 lines and 17 columns) and this gets so slow to the point that it's practically unusable.

Ironically I'm rewriting a python script because I had hoped for better performance, but for now the python version is much quicker.

Does anyone have any pointers on how to speed it up?


Solution

  • The critical operation is searching for existing values. Array#includes? is pretty inefficient in this case: it needs to iterate through all previous rows (for duplicate rows, it's not all but typically half of them). Do that for each row, that's O(N²).

    You need a different data structure which can be searched faster. Crystal has a Set type, which is backed by a hash table.

    There are probably even better data structures and algorithms for this problem, but Set is available in Crystal's standard library and should already improve things a lot.