Search code examples
rubycsvencodingsmartercsv

SmarterCSV and file encoding issues in Ruby


I'm working with a file that appears to have UTF-16LE encoding. If I run

File.read(file, :encoding => 'utf-16le')

the first line of the file is:

"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n

If I read the file using something like

csv_text = File.read(file, :encoding => 'utf-16le')

I get an error stating

ASCII incompatible encoding needs binmode (ArgumentError)

If I switch the encoding in the above to

csv_text = File.read(file, :encoding => 'utf-8')

I make it to the SmarterCSV section of the code, but get an error that states

`=~': invalid byte sequence in UTF-8 (ArgumentError)

The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:

require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
  csv_text = File.read(file, :encoding => 'utf-16le')
  File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
  puts 'made it here'
  SmarterCSV.process('/tmp/tmp_file', {
    :col_sep => "\t",
    :force_simple_split => true,
    :headers_in_file => false,
    :user_provided_headers => headers
   }).each do |row|
    converted_row = {}
    converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
    converted_row[:timestamp] = row[:timestamp]
    converted_row[:sender] = row[:sender][2..-2]
    converted_row[:phone_number] = row[:phone_number][2..-2]
    converted_row[:message] = row[:message][1..-2]
    converted_row[:room] = file.gsub(path, '')
  end
end

Update - 05/13/15

Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.


Solution

  • Add binmode to the File.read call.

    File.read(file, :encoding => 'utf-16le', mode: "rb")
    

    "b" Binary file mode Suppresses EOL <-> CRLF conversion on Windows. And sets external encoding to ASCII-8BIT unless explicitly specified.

    ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read

    Now pass the correct encoding to SmarterCSV

    SmarterCSV.process('/tmp/tmp_file', {
    :file_encoding => "utf-16le", ...
    

    Update

    It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.