Search code examples
rubycsvcharacter-encodingnon-ascii-characters

Desperately trying to remove this diabolical excel generated special character from csv in ruby


My computer has no idea what this character is. It came from Excel.

In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.

This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long "            ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.

I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like

sed: 1: "output.csv": invalid command code o

when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.


Solution

  • The answer to this problem is

    A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.

    but

    B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.

    So to read the file

    f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
    parsed = CSV.parse(f)
    

    then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)

    parsed.each do |p|           #here, p[0] is the metatag column
      @specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
    end