Search code examples
unicodeiconv

Why can iconv convert precomposed form but not decomposed form of "É" (from UTF-8 to CP1252)


I use the iconv library to interface from a modern input source that uses UTF-8 to a legacy system that uses Latin1, aka CP1252 (superset of ISO-8859-1).

The interface recently failed to convert the French string "Éducation", where the "É" was encoded as hex 45 CC 81. Note that the destination encoding does have an "É" character, encoded as C9.

Why does iconv fail converting that "É"? I checked that the iconv command-line tool that's available with MacOS X 10.7.3 says it cannot convert, and that the PERL iconv module fails too.

This is all the more puzzling that the precomposed form of the "É" character (encoded as C3 89) converts just fine.

Is this a bug with iconv or did I miss something?

Note that I also have the same issue if I try to convert from UTF-16 (where "É" is encoded as 00 C9 composed or 00 45 03 01 decomposed).


Solution

  • Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.

    When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.

    However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.

    I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.

    Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.

    #!/usr/bin/perl
    
    use Encode qw/decode_utf8 encode_utf8/;
    use Unicode::Normalize;
    
    while (<>) {
        print encode_utf8( NFC(decode_utf8 $_) );
    }