Search code examples
perlawkposixlocalestandards-compliance

How to make uc() work in Perl like toupper() does in AWK in a locale-aware POSIX environment?


When I use functions like toupper() in AWK, they are automatically locale-aware and process text in the user's current locale.

I would like to do the same in a Perl script, but have failed so far.

For this, I wrote the following ASCII shell script for testing Perl and AWK:

$ unexpand -t 2 << 'END_SCRIPT' | tee case3 && chmod +x case3
#! /bin/sh
{
  iconv -cf UTF-7 \
  | case $1 in
  awk)
    awk '{
      print "original", $0
      print "to lower", tolower($0)
      print "to upper", toupper($0)
    }'
    ;;
  perl)
    perl -e '
      use locale;
      while (defined($_= <>)) {
        print "original ", $_;
        print "to lower ", lc;
        print "to upper ", uc;
      }
    '
  esac \
  | iconv -ct UTF-7 | iconv -cf UTF-7
} << 'EOF'
+AMQ-gypten
S+APw-d
+APY-stlich
EOF
END_SCRIPT

Note the iconv UTF-7 stuff at the end of the script: This is just there to drop any characters from the output that the current locale cannot represent.

Here is the output when I run the script for testing AWK:

$ ./case3 awk
original Ägypten
to lower ägypten
to upper ÄGYPTEN
original Süd
to lower süd
to upper SÜD
original östlich
to lower östlich
to upper ÖSTLICH

This looks quite good and how it should be.

Now the same for Perl:

$ ./case3 perl
original Ägypten
to lower gypten
to upper ÄGYPTEN
original Süd
to lower sd
to upper SüD
original östlich
to lower stlich
to upper öSTLICH

Obviously, this produces different output and works just not right.

I would appreciate to know what I made wrong in the "perl"-case of the script.

Note: I do not want my script to require a UTF-8 locale, it should work with any locale which can represent the German Umlauts used in my test.txt file.

In case you should be curious, the above results were generated with the following locale settings:

$ locale
LANG=de_AT.UTF-8
LANGUAGE=de_AT.UTF-8:de.UTF-8:en_US.UTF-8:de_AT:de:en_US:en
LC_CTYPE="de_AT.UTF-8"
LC_NUMERIC="de_AT.UTF-8"
LC_TIME="de_AT.UTF-8"
LC_COLLATE="de_AT.UTF-8"
LC_MONETARY="de_AT.UTF-8"
LC_MESSAGES="de_AT.UTF-8"
LC_PAPER="de_AT.UTF-8"
LC_NAME="de_AT.UTF-8"
LC_ADDRESS="de_AT.UTF-8"
LC_TELEPHONE="de_AT.UTF-8"
LC_MEASUREMENT="de_AT.UTF-8"
LC_IDENTIFICATION="de_AT.UTF-8"
LC_ALL=

Solution

  • This is not quite what you asked since it determines casing based on Unicode rules instead of the locale's rules, but it will work for all locales (UTF-8 and otherwise):

    use open ':std', ':locale';
    while (<>) {
        print "original ", $_;
        print "to lower ", lc;
        print "to upper ", uc;
    }