Erasing incorrectly encoded byte sequences on reading

I am reading files into Ruby strings, and these strings are later processed further (for instance, using the CSV module). The external encoding of the files is a parameter, and supposedly, the files to be processed should be of that specified encoding.

During reading, I convert the files from the supposed external encoding into UTF-8.

Occasionally, I get erroneous files which are encoded in a different way than the specified encoding.

Of course if the encoding is wrong, my program will read only garbage, but if the encoding is not only wrong, but even contains byte sequences which are illegal under the supposed encoding, I will get an exception when processing the file.

The specification requires that those byte sequences, which can not be deciphered due to incorrect encoding, should be simply removed from the input file instead of causing the program to abort.

To implement this, I am reading a file into a string like this:

 UTF8_CONVERTER = ->(field) { field.encode('utf-8', invalid: :replace, undef: :replace, replace: "") }

read_flags = {
   external_encoding: ext_enc, # i.e. Encoding::ISO_8859_1
   internal_encoding: Encoding::UTF_8,
   converters: UTF8_CONVERTER
}

file_content = IO.read(file_path, read_flags)

IMO, this should make a file_content a valid string which is UTF-8 encoded. If my program later decides that this string should be CSV parsed, it invokes the csv parser like this:

e_enc = file_content.encoding
i_enc = Encoding::UTF_8
...
csv_opt = { col_sep: ';', row_sep: :auto, external_encoding: e_enc, internal_encoding: i_enc}
CSV.foreach(file_content, csv_opt) { .... }

The reason why I redundantly specify the encoding here too, is, that the method which is processing the CSV, has a general purpose, and also should work if Strings have a different encoding.

However, this does not work:

If I am processing a file which is supposed to be UTF-8 (i.e. ext_enc equals Encoding::UTF_8), but in reality was encoded in for instance Windows-1252, and there are some byte sequence in it, which would be illegal under UTF, CSV.foreach raises the exception ArgumentError: invalid byte sequence in UTF-8.

I conclude from this, that my UTF8_CONVERTER did not remove the incorrect bytes.

Can anybody see what I'm doing wrong here?

UPDATE

@Stefan pointed out in his comment, that the converter option can't be used in IO.read and suggested that I pass the conversion option directly. This does not work either (I have to use JRuby 1.7.21, which is equivalent to Ruby 1.9.3). I could at least create a small, reproducible example:

I create a file illegal.txt, with the following content:

> xxd illegal.txt
00000000: 66fc 720a                                f.r.

We can see that it contains the byte sequence FC 72, which is not legal UTF-8.

Now I read the file with

fc=IO.read('illegal.txt', {:external_encoding=>#<Encoding:UTF-8>, :internal_encoding=>#<Encoding:UTF-8>, :invalid=>:replace, :undef=>:replace, :replace=>""}

I would have expected, that this would have removed at least the FC, so that the resulting string would be "fr\n", or maybe just "f\n". However, when I do a

puts fc.bytes.to_a

I still see [102, 252, 114, 10] printed.

Solution

When reading the file via IO.read, you have to specify the external encoding as ASCII and the internal encoding as UTF-8 in order to replace invalid or undefined byte sequences: (I'm using '_' as a replacement string for demonstration purposes here, empty string works as well)

data = IO.read('invalid.txt', encoding: 'ASCII:UTF-8',
                                 undef: :replace,
                               invalid: :replace,
                               replace: '_')

data          #=> "f_r\n"
data.bytes    #=> [102, 95, 114, 10]
data.encoding #=> #<Encoding:UTF-8>

Altenatively, you can binread the file:

data = IO.binread('invalid.txt')

data          #=> "f\xFCr\n"
data.bytes    #=> [102, 252, 114, 10]
data.encoding #=> #<Encoding:ASCII-8BIT>

... and encode! the string afterwards:

data.encode!('utf-8', undef: :replace, invalid: :replace, replace: '_')

data          #=> "f_r\n"
data.bytes    #=> [102, 95, 114, 10]
data.encoding #=> #<Encoding:UTF-8>