I am having trouble handling text files of tabulated data generated on a windows machine. I'm working in Ruby 1.8. The following gives an error ("\000" (Iconv::InvalidCharacter)) when processing the SECOND line from the file. The first line is converted properly.
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets)
line = conv.iconv(line.strip) # FAILS HERE
puts line
# DO MORE STUFF HERE
end
The strange thing is that it reads and converts the first line in the file with no problem. I have the //IGNORE flag in the Iconv constructor -- I thought this was supposed to suppress this kind of error.
I've been going in circles for a while. Any advice would be highly appreciated.
Thanks!
EDIT: hobbs solution fixes this. Thank you. Simply change the code to:
require 'iconv'
conv = Iconv.new("UTF-8//IGNORE","UTF-16")
infile = File.open(tabfile, "r")
while (line = infile.gets("\x0a\x00"))
line = conv.iconv(line.strip) # NO LONGER FAILS HERE
# DOES MORE STUFF HERE
end
Now I'll just need to find a way to automatically determine which gets separator to use.
The error message is pretty vague, but I think it's unhappy about the fact that it's found an odd number of bytes on a line, since every character in UTF-16 is two (or occasionally four) bytes. And I think the reason for that is your use of gets
-- the lines in your file are separated by a UTF-16le newline, which is 0x0a 0x00
, but gets
is splitting on (and strip
is removing) 0x0a
only.
To illustrate: suppose the file contains
ab
cd
encoded in UTF-16le. That's
0x61 0x00 0x62 0x00 0x0a 0x00 0x63 0x00 0x64 0x00 0x0a 0x00
a b \n c d \n
gets
reads up to the first 0x0a
, which strip
removes, so the first line read is 0x61 0x00 0x62 0x00
, which iconv happily accepts and encodes to UTF-8 as 0x61 0x62
— "ab". gets
then reads up to the next 0x0a
, which strip
again removes, so the second time line
gets 0x00 0x63 0x00 0x64 0x00
and now everything is screwed up — we're out of sync by one byte and there's an odd number of bytes to convert, and iconv
blows up because that's incompatible with what you asked it to do.
Absent an actual working file encoding/decoding layer, I think what you want is to change the gets
separator from "\n"
("\x0a"
) to "\x0a\x00"
, abandon all use of strip
since it's not encoding-clean, and use print
instead of puts
so that you don't add extra line-ends (since you'll be converting the ones you've already got).
If you're working with windows files, a windows CRLF in UTF-16le is "\x0d\x00\x0a\x00"
.