Search code examples
regexperlunicodegrapheme

What is the right way to get a grapheme?


Why does this print a U and not a Ü?

#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':utf8';
use charnames qw(:full);

my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}";

while ( $string =~ /(\X)/g ) {
        say $1;
}

# Output: U

Solution

  • Your code is correct.

    You really do need to play these things by the numbers; don’t trust what a "terminal" displays. Pipe it through the uniquote program, probably with -x or -v, and see what it is really doing.

    Eyes deceive, and programs are even worse. Your terminal program is buggy, so is lying to you. Normalization shouldn’t matter.

    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"'
    crème brûlée
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' | uniquote -x
    cr\x{E8}me br\x{FB}l\x{E9}e
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' 
    crème brûlée
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' | uniquote -x
    cre\x{300}me bru\x{302}le\x{301}e
    
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée"' 
    éel̂urb em̀erc
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée")' | uniquote -x
    \x{E9}el\x{302}urb em\x{300}erc
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"'
    éel̂urb em̀erc
    $ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' | uniquote -x
    e\x{301}el\x{302}urb em\x{300}erc