Search code examples
rubyencodingutf-8iconv

Equivalent of Iconv.conv("UTF-8//IGNORE",...) in Ruby 1.9.X?


I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.

I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.

Main goal is to get a string I can use, and not run into errors such as:

  • Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
  • invalid byte sequence in utf-8

Solution

  • I thought this was it:

    string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

    will replace all knowns with '?'.

    To ignore all unknowns, :replace => '':

    string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

    Edit:

    I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:

    string.encode("UTF-8", ...).force_encoding('UTF-8')

    Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.

    Edit 2:

    Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.