Search code examples
perlunicodelocalemultilingualcollation

Multilingual text sorting in Perl, on Windows, using locale


I am building a piece of software for sorting book indexes in different languages. It uses Perl, and keys off of the locale. I am developing it on Unix, but it needs to be portable to Windows. Should this work in principle, or by relying on locale, am I barking up the wrong tree? Bottom line, Windows is really where I need this to work, but I am more comfortable developing in my UNIX environment.


Solution

  • Assuming that your starting point is Unicode, because you have been very careful to decode all incoming data no matter what its native encoding might be, then it is easy to use to the Unicode::Collate module as a starting point.

    If you want locale tailoring, then you probably want to start with Unicode::Collate::Locale instead.

    Decoding into Unicode

    If you run in an all-UTF8 environment, this is easy, but if you are subject to the vicissitudes of random so-called “locales” (or even worse, the ugly things Microsoft calls “code pages”), then you might want to get the CPAN Encode::Locale module to help you out. For example:

     use Encode;
     use Encode::Locale;
    
     # use "locale" as an arg to encode/decode
     @ARGV = map { decode(locale =>  $_) } @ARGV;
    
     # or as a stream for binmode or open
     binmode $some_fh, ":encoding(locale)";
    
     binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
     binmode STDOUT, ":encoding(console_out)"  if -t STDOUT;
     binmode STDERR, ":encoding(console_out)"  if -t STDERR;
    

    (If it were me, I would just use ":utf8" for the output.)


    Standard Collation, plus locales and tailoring

    The point is, once you have everything decoded into internal Perl format, you can use Unicode::Collate and Unicode::Collate::Locale on it. These can be really easy:

       use v5.14;
       use utf8;
       use Unicode::Collate;
       my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
       @exes = Unicode::Collate->new->sort(@exes);
       say "@exes";
    
       # prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
    

    Or they can be pretty fancy. Here is one that tries to deal with book titles: it strips leading articles and zero-pads numbers.

    my $collator = Unicode::Collate->new(
        --upper_before_lower => 1,
        --preprocess => {
            local $_ = shift;
            s/^ (?: The | An? ) \h+ //x;  # strip articles
            s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
            return $_;
        };
    );
    

    Now just use that object’s sort method to sort with.

    Sometimes you need to turn the sort inside out. For example:

     my $collator = Unicode::Collate->new();
     for my $rec (@recs) {
         $rec->{NAME_key} = 
            $collator->getSortKey( $rec->{NAME} );
     }
     @srecs = sort {
         $b->{AGE}       <=>  $a->{AGE}
                         ||
         $a->{NAME_key}  cmp  $b->{NAME_key}
     } @recs;
    

    The reason you have to do that is because you are sorting on a record with various fields. The binary sort key allows you to use the cmp operator on data that has been through your chosen/custom collator object.

    The full constructor for the collator object has all this for a formal syntax:

          $Collator = Unicode::Collate->new(
             UCA_Version => $UCA_Version,
             alternate => $alternate, # alias for 'variable'
             backwards => $levelNumber, # or \@levelNumbers
             entry => $element,
             hangul_terminator => $term_primary_weight,
             highestFFFF => $bool,
             identical => $bool,
             ignoreName => qr/$ignoreName/,
             ignoreChar => qr/$ignoreChar/,
             ignore_level2 => $bool,
             katakana_before_hiragana => $bool,
             level => $collationLevel,
             minimalFFFE => $bool,
             normalization  => $normalization_form,
             overrideCJK => \&overrideCJK,
             overrideHangul => \&overrideHangul,
             preprocess => \&preprocess,
             rearrange => \@charList,
             rewrite => \&rewrite,
             suppress => \@charList,
             table => $filename,
             undefName => qr/$undefName/,
             undefChar => qr/$undefChar/,
             upper_before_lower => $bool,
             variable => $variable,
          );
    

    But you usually don’t have to worry about almost any of those. In fact, if you want country-specific locale tailoring using the CLDR data, you should just use Unicode::Collate::Locale, which adds exactly one more parameter to the constructor: locale => $country_code.

     use Unicode::Collate::Locale;
     $coll = Unicode::Collate::Locale->
               new(locale => "fr");
     @french_text = $coll->sort(@french_text);
    

    See how easy that is?

    But you can do other cool things, too.

     use Unicode::Collate::Locale;
     my $Collator = new Unicode::Collate::Locale::
                     locale => "de__phonebook",
                     level  => 1,
                     normalization => undef,
                    ;
    
     my $full = "Ich müß Perl studieren.";
     my $sub = "MUESS";
     if (my ($pos,$len) = $Collator->index($full, $sub)) {
         my $match = substr($full, $pos, $len);
         say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";
    
     }
    

    When run, that says:

     Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›
    

    Here are the available locales as of v0.96 of the Unicode::Collate::Locale module, taken from its manpage:

     locale name       description
    --------------------------------------------------------------
     af                Afrikaans
     ar                Arabic
     as                Assamese
     az                Azerbaijani (Azeri)
     be                Belarusian
     bg                Bulgarian
     bn                Bengali
     bs                Bosnian
     bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
     ca                Catalan
     cs                Czech
     cy                Welsh
     da                Danish
     de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
     ee                Ewe
     eo                Esperanto
     es                Spanish
     es__traditional   Spanish ('ch' and 'll' as a grapheme)
     et                Estonian
     fa                Persian
     fi                Finnish (v and w are primary equal)
     fi__phonebook     Finnish (v and w as separate characters)
     fil               Filipino
     fo                Faroese
     fr                French
     gu                Gujarati
     ha                Hausa
     haw               Hawaiian
     hi                Hindi
     hr                Croatian
     hu                Hungarian
     hy                Armenian
     ig                Igbo
     is                Icelandic
     ja                Japanese [1]
     kk                Kazakh
     kl                Kalaallisut
     kn                Kannada
     ko                Korean [2]
     kok               Konkani
     ln                Lingala
     lt                Lithuanian
     lv                Latvian
     mk                Macedonian
     ml                Malayalam
     mr                Marathi
     mt                Maltese
     nb                Norwegian Bokmal
     nn                Norwegian Nynorsk
     nso               Northern Sotho
     om                Oromo
     or                Oriya
     pa                Punjabi
     pl                Polish
     ro                Romanian
     ru                Russian
     sa                Sanskrit
     se                Northern Sami
     si                Sinhala
     si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
     sk                Slovak
     sl                Slovenian
     sq                Albanian
     sr                Serbian
     sr_Latn           Serbian in Latin (tailored as Croatian)
     sv                Swedish (v and w are primary equal)
     sv__reformed      Swedish (v and w as separate characters)
     ta                Tamil
     te                Telugu
     th                Thai
     tn                Tswana
     to                Tonga
     tr                Turkish
     uk                Ukrainian
     ur                Urdu
     vi                Vietnamese
     wae               Walser
     wo                Wolof
     yo                Yoruba
     zh                Chinese
     zh__big5han       Chinese (ideographs: big5 order)
     zh__gb2312han     Chinese (ideographs: GB-2312 order)
     zh__pinyin        Chinese (ideographs: pinyin order) [3]
     zh__stroke        Chinese (ideographs: stroke order) [3]
     zh__zhuyin        Chinese (ideographs: zhuyin order) [3]
    
       Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
       it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
       (Zulu).
    
       Note
    
       [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and halfwidth forms are identical to their regular form.  The
       difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
       and then "katakana_before_hiragana" has no effect.
    
       [2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
       (level 2) greater than, the corresponding hangul syllable.
    
       [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.
    
       Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.
    

    So in summary, the main trick is to get your local data decoded into a uniform Unicode representation, then use deterministic sorting, possibly tailored, that doesn’t rely on random settings of the user’s console window for correct behavior.


    Note: All these examples, apart from the manpage citation, are lovingly lifted from the 4th edition of Programming Perl, by kind permission of its author. :)