Search code examples
encodingutf-8character-encodingiconvcp1252

"Raw" conversion from double-UTF-8 to UTF-8 (or from UTF-8 to ANSI)


I am dealing with a legacy file that has been encoded twice using UTF-8. For example, the codepoint ε (U+03B5) should had been encoded as CE B5 but has instead been encoded as C3 8E C2 B5 (CE 8E is the UTF-8 encoding of U+00CE, C2 B5 is the UTF-8 encoding of U+00B5).

The second encoding has been performed assuming the data was encoding in CP-1252.

To go back to the UTF-8 encoding I use the following (seemly wrong) command

iconv --from utf8 --to cp1252 <file.double-utf8 >file.utf8

My problem is that iconv seems unable to convert back some characters. More precisely, iconv is unable to convert characters whose UTF-8 representation contains a character that map to a control character in CP-1252. One examples is the codepoint ρ (U+03C1):

  • its UTF-8 encoding is CF 81,
  • the first byte CF is re-encoded to C3 8F,
  • the second byte 81 is re-encoded to C2 81.

iconv refuses to convert C2 81 back to 81, probably because it does not know how to map that control character precisely.

echo -e -n '\xc3\x8f\xc2\x81' | iconv --from utf8 --to cp1252
�iconv: illegal input sequence at position 2

How can I tell iconv to just perform the mathematical UTF-8 conversion without caring about the mappings?


Solution

  • The following code uses the low-level encoding functions of Ruby to force the rewriting of double encoded UTF-8 (from CP1525) into normal UTF-8.

    #!/usr/bin/env ruby
    
    ec = Encoding::Converter.new(Encoding::UTF_8, Encoding::CP1252)
    
    prev_b = nil
    
    orig_bytes = STDIN.read.force_encoding(Encoding::BINARY).bytes.to_a
    real_utf8_bytes = ""
    real_utf8_bytes.force_encoding(Encoding::BINARY)
    
    orig_bytes.each_with_index do |b, i|
        b = b.chr
    
        situation = ec.primitive_convert(b.dup, real_utf8_bytes, nil, nil, Encoding::Converter::PARTIAL_INPUT)
    
        if situation == :undefined_conversion
                if prev_b != "\xC2"
                        $stderr.puts "ERROR found byte #{b.dump} in stream (prev #{(prev_b||'').dump})"
                        exit
                end
    
                real_utf8_bytes.force_encoding(Encoding::BINARY)
                real_utf8_bytes << b
                real_utf8_bytes.force_encoding(Encoding::CP1252)
        end
    
        prev_b = b
    end
    
    real_utf8_bytes.force_encoding(Encoding::BINARY)
    puts real_utf8_bytes
    

    It is meant to be used in a pipeline:

    cat $PROBLEMATIC_FILE | ./fix-double-utf8-encoding.rb > $CORRECTED_FILE