Search code examples
ruby-on-railsrubyruby-on-rails-3encodingchardet

How do I encode files to UTF-8 for Rails 3?


I've been working on outlook imports (linked in exports to outlook format) but I'm having troubles with encoding. The outlook format CSV I get from exporting my LinkedIn contacts are not in UTF-8. Letters like ñ cause an exception in the mongoid_search gem when calling str.to_s.mb_chars.normalize. I think encoding is the issue, because when I call mb_chars (see first code example). I am not sure if this is a bug in the gem, but I was advised to sanitize the data nonetheless.

From File Picker, I tried using their new, community-supported gem to upload CSV data. I tried three encoding detectors and transcoders:

  1. Ruby port of a Python lib chardet
    • Didn't work as expected
    • The port still contained Python code, preventing it from running in my app
  2. rchardet19 gem
    • Detected iso-8859 with .8/1 confidence.
    • Tried to transcode with Iconv, but crashed on "illegal characters" at ñ
  3. Charlock_Holmes gem
    • Detected windows-1252 with 33/100 confidence
    • I assume that's the actual encoding, and rchardet got iso-8859 because this ones based of that.
    • This gem uses ICU and has a maintained branch "bundle-icu" which supports Heroku. When I try to transcode using charlock, I get the error U_FILE_ACCESS_ERROR, an ICU error code meaning "could not open file"

Anybody know what to do here?


Solution

  • Ruby 1.9 has encoding built in, have you tried:

    s.force_encoding 'utf-8'
    

    mb_chars is a wrapper for ruby 1.8, so you shouldn't need it.

    See duplicate

    how to convert character encoding with ruby 1.9