Search code examples
perlutf-8utf8-decode

Unicode Juggling with Perl


I have a problem I thought to be trivial. I have to deal with Umlauts from the German alphabet (äöü). In Unicode, there seem to be several ways to display them, one of them is combining characters. I need to normalise these different ways, replace them all by the one-character code.

Such a deviant umlaut is easily found: It is a letter aou, followed by the UTF-8 char \uCC88. So I thought a regex would suffice.

This is my conversion function, employing the Encoding package.

# This sub can be extended to include more conversions
sub convert {
    local $_;
    $_ = shift;

    $_ = encode( "utf-8", $_ );

    s/u\xcc\x88/ü/g;
    s/a\xcc\x88/ä/g;
    s/o\xcc\x88/ö/g;
    s/U\xcc\x88/Ü/g;
    s/A\xcc\x88/Ä/g;
    s/O\xcc\x88/Ö/g;

    return $_;
}

But the resulting printed umlaut is some even more devious character (now taking 4 bytes), instead of the one on this list.

I guess the problem is this juggling with Perl's internal format, actual UTF-8 and this Encoding format.

Even changing the substitution lines to

s/u\xcc\x88/\xc3\xbc/g;
s/a\xcc\x88/\xc3\xa4/g;
s/o\xcc\x88/\xc3\xb6/g;
s/U\xcc\x88/\xc3\x9c/g;
s/A\xcc\x88/\xc3\x84/g;
s/O\xcc\x88/\xc3\x96/g;

did not help, they're converted correctly but then followed by "\xC2\xA4" in the bytes.

Any help?


Solution

  • You're doing it wrong: you must stop the habit of messing with characters on the representation level, i.e. do not fiddle with bytes in regex when you deal with text, not binary data.

    The first step is to learn about the topic of encoding in Perl. You need this to understand the term "character strings" I am going to use in the following paragraph.

    When you have character string, it might be in any of the various states of (de)composition. Use the module Unicode::Normalize to change a character string, and read the relevant chapters on equivalence and normalisation in the Unicode specification for the gory details, they are linked at the bottom of that module's documentation.

    I guess you want NFC, but you have to run a sanity check against your data to see whether that's really the intended result.

    use charnames qw(:full);
    use Unicode::Normalize qw(NFC);
    my $original_character_string = "In des Waldes tiefsten Gr\N{LATIN SMALL LETTER U WITH DIAERESIS}nden ist kein R\N{LATIN SMALL LETTER A}\N{COMBINING DIAERESIS}uber mehr zu finden.";
    my $modified_character_string = NFC($original_character_string);
    # "In des Waldes tiefsten Gr\x{fc}nden ist kein R\x{e4}uber mehr zu finden."