Search code examples
perlsortingunicodeutf-8locale

Why don’t word characters (\w) match right under the use locale pragma?


When I use locale, some characters from my locale (et_EE.UTF-8) are not matched with \w and I don't see any reason there.

In addition to ASCII, Estonian uses six more characters:

õäöüšž

In my test script below I use them in $string with three additional special characters ðŋц (which do not belong to the Estonian alphabet).

use feature 'say';
use POSIX qw( locale_h );

{
  use utf8;
  my  $string = "õäöüšž ðŋц";
  binmode STDOUT, ":encoding(UTF-8)";
  say "nothing";
  say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
  say 'UC: ', uc( $string );
  say 'SORT: ', sort( split(//, $string) );
  say $string =~ m/\w/g;
  say $string =~ m/\p{Word}/g;
  say '';
}

{
  use utf8;
  use locale;
  binmode STDOUT, ":encoding(UTF-8)";
  my  $string = "õäöüšž ðŋц";
  say "locale";
  say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
  say 'UC: ', uc( $string );
  say 'SORT: ', sort( split(//, $string) );
  say $string =~ m/\w/g;
  say $string =~ m/\p{Word}/g;
  say '';
}

{
  use utf8::all;
  my  $string = "õäöüšž ðŋц";
  say "utf8::all";
  say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
  say 'UC: ', uc( $string );
  say 'SORT: ', sort( split(//, $string) );
  say $string =~ m/\w/g;
  say $string =~ m/\p{Word}/g;
  say '';
}

{
  use utf8::all;
  use locale;
  my  $string = "õäöüšž ðŋц";
  say "utf8::all + locale";
  say 'LOCALE: ', setlocale(LC_CTYPE), ' ', setlocale(LC_COLLATE);
  say 'UC: ', uc( $string );
  say 'SORT: ', sort( split(//, $string) );
  say $string =~ m/\w/g;
  say $string =~ m/\p{Word}/g;
  say '';
}

I tried with Perl 5.10.1 and 5.14.2 and both gave me such output:

nothing
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT:  äðõöüŋšžц
õäöüšžðŋц
õäöüšžðŋц

locale
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT:  ðŋšžõäöüц
šžŋц
õäöüšžðŋц

utf8::all
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT:  äðõöüŋšžц
õäöüšžðŋц
õäöüšžðŋц

utf8::all + locale
LOCALE: et_EE.UTF-8 et_EE.UTF-8
UC: ÕÄÖÜŠŽ ÐŊЦ
SORT:  ðŋšžõäöüц
šžŋц
õäöüšžðŋц

What is not like I expected?

  • main problem: under use locale I hoped \w to match all my six chars, but the result šžŋц is quite a weird. Why such matches? From perlrecharclass i read:

For code points above 255 ... \w matches the same as \p{Word} matches in this range. ... For code points below 256 ... if locale rules are in effect ... \w matches the platform's native underscore character plus whatever the locale considers to be alphanumeric.

So, \w matches there chars above 255, but does not match "whatever the locale considers to be alphanumeric". Why? Same time sorting under locale works fine (and without locale does not), the result ðŋšžõäöüц is right order, that shows that there are right chars properly represented. AFAIU, sort could not work fine without knowing them "whatever the locale considers to be alphanumeric". Or?

  • i thought that setlocale gives result only under locale-pragma. How could i test, which locale is effective for scope?
  • i did not expect that all characters are upper-cased in every test case. AFAIU uc and lc should be locale dependent. On first case i thought they will all lower-cased, but using locale i waited first six chars being upper-cased while others not. Only case i waited all chars upper-cased, was third. I see i miss something important here. Oops, now i found from lc docs: "Otherwise, If EXPR has the UTF-8 flag set: Unicode semantics are used for the case change." UTF-8 flag is always set on my $string, so this got answer during writing it.

Using locale for sorting and \p{Word} for matching is acceptable for me, but i still would use some hints: why \w does not work as i expected?


Solution

  • Please do not use the broken use locale pragma.

    Please, please, please use Unicode::Collate::Locale for locale collation. It uses the CLDR rules, and is completely portable and doesn’t rely on dodgy broken POSIX locales, which simply do not work well.

    If you sort by code point, you get nonsense, but if you sort using a Unicode::Collate::Locale object constructed with the Estonian locale, you get something reasonable:

    Codepoint sort:  äðõöüŋšžц
    Estonian  sort:  ðŋšžõäöüц
    

    Also, when you do this raw codepoint sort, you are terribly affected by normalization matters. Consider:

    NFC/NFD sort by codepoint is DIFFERENT
    NFC Codepoint sort:  äðõöüŋšžц
    NFD Codepoint sort:  äõöšüžðŋц
    
    NFC/NFD sort in estonian  is SAME
    NFC Estonian  sort:  ðŋšžõäöüц
    NFD Estonian  sort:  ðŋšžõäöüц
    

    And here is the demo program that produced all that.

    #!/usr/bin/env perl
    #
    # et-demo - show how to handle Estonian collation correctly
    #
    # Tom Christinansen <[email protected]>
    # Fri Feb 22 19:27:51 MST 2013
    
    use v5.14;
    use utf8;
    use strict;
    use warnings;
    use warnings FATAL => "utf8";
    use open qw(:std :utf8);
    
    use Unicode::Normalize;
    use Unicode::Collate::Locale;
    
    main();
    exit();
    
    sub graphemes(_) {
        my($str) = @_;
        my @graphs = $str =~ /\X/g;
        return @graphs;
    }
    
    sub same_diff($$) {
        my($s1, $s2) = @_;
        no locale;
    
        if (NFC($s1) eq NFC($s2)) {
            return "SAME";
        } else {
            return "DIFFERENT";
        }
    }
    
    sub stringy {
        return join("" => @_);
    }
    
    sub cp_sort {
        no locale;
        return sort @_;
    }
    
    sub et_sort {
        state $collator = # we want Estonian here:
            Unicode::Collate::Locale->new(locale => "et");
        return $collator->sort(@_);
    }
    
    sub main {
        my $orig = "õäöüšž ðŋц";
    
        say "    Codepoint sort: ", cp_sort(graphemes($orig));
        say "    Estonian  sort: ", et_sort(graphemes($orig));
    
        my $nfc = NFC($orig);
        my $nfc_cp_sort = stringy cp_sort(graphemes($nfc));
        my $nfc_et_sort = stringy et_sort(graphemes($nfc));
    
        my $nfd = NFD($orig);
        my $nfd_cp_sort = stringy cp_sort(graphemes($nfd));
        my $nfd_et_sort = stringy et_sort(graphemes($nfd));
    
        say "NFC/NFD sort by codepoint is ",
            same_diff($nfc_cp_sort, $nfd_cp_sort);
    
        say "NFC Codepoint sort: ", $nfc_cp_sort;
        say "NFD Codepoint sort: ", $nfd_cp_sort;
    
        say "NFC/NFD sort in estonian  is ",
            same_diff($nfc_et_sort, $nfd_et_sort);
    
        say "NFC Estonian  sort: ", $nfc_et_sort;
        say "NFD Estonian  sort: ", $nfd_et_sort;
    
    }
    

    That really is how you should be handling locale collation. See also this answer for numerous examples.