Search code examples
rubyutf-8character-encodingutf-16iconv

Ruby 1.8 Iconv UTF-16 to UTF-8 fails with "\000" (Iconv::InvalidCharacter)


I am having trouble handling text files of tabulated data generated on a windows machine. I'm working in Ruby 1.8. The following gives an error ("\000" (Iconv::InvalidCharacter)) when processing the SECOND line from the file. The first line is converted properly.

require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
  line = conv.iconv(line.strip)  # FAILS HERE
  puts line
  # DO MORE STUFF HERE
end

The strange thing is that it reads and converts the first line in the file with no problem. I have the //IGNORE flag in the Iconv constructor -- I thought this was supposed to suppress this kind of error.

I've been going in circles for a while. Any advice would be highly appreciated.

Thanks!

EDIT: hobbs solution fixes this. Thank you. Simply change the code to:

require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
  line = conv.iconv(line.strip)  # NO LONGER FAILS HERE
  # DOES MORE STUFF HERE
end

Now I'll just need to find a way to automatically determine which gets separator to use.


Solution

  • The error message is pretty vague, but I think it's unhappy about the fact that it's found an odd number of bytes on a line, since every character in UTF-16 is two (or occasionally four) bytes. And I think the reason for that is your use of gets-- the lines in your file are separated by a UTF-16le newline, which is 0x0a 0x00, but gets is splitting on (and strip is removing) 0x0a only.

    To illustrate: suppose the file contains

    ab
    cd
    

    encoded in UTF-16le. That's

    0x61 0x00 0x62 0x00 0x0a 0x00 0x63 0x00 0x64 0x00 0x0a 0x00
        a         b         \n        c         d         \n
    

    gets reads up to the first 0x0a, which strip removes, so the first line read is 0x61 0x00 0x62 0x00, which iconv happily accepts and encodes to UTF-8 as 0x61 0x62 — "ab". gets then reads up to the next 0x0a, which strip again removes, so the second time line gets 0x00 0x63 0x00 0x64 0x00 and now everything is screwed up — we're out of sync by one byte and there's an odd number of bytes to convert, and iconv blows up because that's incompatible with what you asked it to do.

    Absent an actual working file encoding/decoding layer, I think what you want is to change the gets separator from "\n" ("\x0a") to "\x0a\x00", abandon all use of strip since it's not encoding-clean, and use print instead of puts so that you don't add extra line-ends (since you'll be converting the ones you've already got).

    If you're working with windows files, a windows CRLF in UTF-16le is "\x0d\x00\x0a\x00".