Search code examples
perlutf-8locale

Why does Encode::decode with non-latin letter locales blow up on localised strftime output?


On Ubuntu with Perl 5.26.1 I have encountered the following problem when working on Dancer::Logger::Console. I've lifted this code out of Dancer2::Core::Role::Logger.

In order to run this, you need to generate the following locales:

sudo locale-gen de_DE.UTF-8
sudo locale-gen ko_KR.UTF-8

This example code uses the Korean locale, and fails without an error message. $@ is empty.

$ LC_ALL=ko_KR.UTF-8 perl -MPOSIX -MEncode -E 'eval {
    say Encode::decode("UTF-8", strftime("%b", localtime))
  }; 
  say $@;
  '
Wide character at -e line 1.

When run with a German locale, it succeeds (but throws a wide character warning, which we can ignore for this test).

$ LC_ALL=de_DE.UTF-8 perl -MPOSIX -MEncode -E 'eval {
    say Encode::decode("UTF-8", strftime("%b", localtime))
  }; 
  say $@;
  '
Wide character in say at -e line 2.
M�r

The %b formatting is the abbreviated month as localised word (see http://strftime.net/).

If we don't Encode::decode("UTF-8", ...), it works, and the version above with Korean produces 3월.

What's going on here?


Solution

  • Under ko_KR.UTF-8, strftime("%b", localtime(1552997524)) returns 20.33.C6D4. When interpreted as Unicode Code Points, this is "␠3월" ("March", with a leading space).

    Under de_DE.UTF-8, strftime("%b", localtime(1552997524)) returns 4D.E4.72. When interpreted as Unicode Code Points, this is "Mär" (short form of "März", "March").

    So it seems decoded text (Unicode Code Points) are being returned, which is perfect. All that's left to do is to encode the outputs.

    $ LC_ALL=ko_KR.UTF-8 perl -CSD -MPOSIX -e'CORE::say strftime("%b", localtime)'
     3월
    
    $ LC_ALL=de_DE.UTF-8 perl -CSD -MPOSIX -e'CORE::say strftime("%b", localtime)'
    Mär
    

    In a program (as opposed to a one-liner), you could use something like the following instead of -CSD:

    use open ':std', ':encoding(UTF-8)';