Search code examples
rubyencoding

Remove invalid bytes, keep valid UTF-8 (in Ruby 2)


(I posted a similar problem here, but this new question is not a duplicate).

Using either Ruby 2.6.10 or 1.9.3 is a requirement that the software I'm developing here must run with these:

Small reproducible problem:

b = "L\xF6sé侍"

We have here a string, where one of its one bytes is illegal in UTF-8 (it's the byte with hex F6). The encoding of the String from the Ruby viewpoint is Encoding::UTF_8. Looking at the byte sequence, we can see

p b.bytes.to_a

=>

[76, 246, 115, 195, 169, 228, 190, 141]

My goal is to remove from the string all bytes which are illegal in UTF-8. I want to get in my simple example a string with content "Lsé侍".

I tried

c1 = b.encode('UTF-8', invalid: :replace, replace: '')

but c1 has the same content as b. Then I tried

b.force_encoding(Encoding::ASCII_UTF8)
c2 = b.encode('UTF-8', invalid: :replace, replace: '')

but this also erases the characters é and 侍, since they are not valid in ASCII.

I also was thinking of putting together a hard coded list of those byte values which are invalid in UTF8, and simply delete them from the string, but this is ugly.

Any ideas how this can be done?

UPDATE: I published the code here based on my experiments in irb, but it turned out that irb seems to behave here a bit different from the non-interactive Ruby. You can find here a screenshot, which is based on the comment given by User @mate. To make it work, I couldn't assign the string in my JRuby program (this would have been rejected at compile time already), but read it from a file (which is what happens in our "real" application anyway).

Hence if you want to reproduce the example, download the file with the erroneous text from this download link and use the following Ruby script to run it:

p RUBY_VERSION
str = File.read("./errf.txt")
p str.bytes.to_a
str2 = str.encode('UTF-8', invalid: :replace, replace: '')
p str2.bytes.to_a

Solution

  • You could split the string by "character", select the valid ones and re-join them into a string again:

    "L\xF6sé侍".each_char.select(&:valid_encoding?).join()
    
    "Lsé侍"