Search code examples
stringperlutf-8

How to solve Perl's `length 'für' == 4` for `LC_CTYPE="en_US.UTF-8"`?


Implementing a line wrapping algorithm I realized that Perl's length function returns unexpected results if the input uses umlauts, like this (so the three-character string "für" has a length of four):

> perl -e 'print length "für" == 4, "\n"'
1

In detail (in Perl debugger):

  DB<7> x length "für"
0  4
  DB<8> foreach (0..3)  { print substr('für', $_), "\n" }
für
ür
▒r
r

  DB<9> q

The obvious question is: How can I make Perl to interpret the string as UTF-8 string?

The manual suggests it should automatically.

"perluniintro(1)" states in " Perl's Unicode Model":

Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of Unicode characters. The general principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data is transparently upgraded to Unicode. Prior to Perl v5.14.0, the upgrade was not completely transparent (see "The "Unicode Bug"" in perlunicode), and for backwards compatibility, full transparency is not gained unless "use feature 'unicode_strings'" (see feature) or "use 5.012" (or higher) is selected.

The code where I found the issue explicitly uses require 5.018_000;, so shouldn't Perl "do the right thing" automatically?

(I found an older question of mine (printf aligning problem with degree (°) character) with a similar problem, but connected to printing strings. I don't see how the answers apply here)

The Bytes

As I had been asked for the bytes in the string, here is what I got:

> perl -e "print 'für'" | od -x
0000000 c366 72bc
0000004
> perl -e "print unpack('H*', 'für');" -e 'print "\n"'
66c3bc72
> perl -e "print 'für'" | hexdump -C
00000000  66 c3 bc 72                                       |f..r|
00000004

use utf8 as a Solution

It had been suggested to use utf8, and actually I had been trying it, but it doesn't really work, and I don't understand either:

> perl -de1

Loading DB routines from perl5db.pl version 1.39_10
Editor support available.

Enter h or 'h h' for help, or 'man perldebug' for more help.

main::(-e:1):   1
  DB<1> use utf8

  DB<2> print length "für"
4
  DB<3> q
> perl -Mutf8 -de1

Loading DB routines from perl5db.pl version 1.39_10
Editor support available.

Enter h or 'h h' for help, or 'man perldebug' for more help.

main::(-e:1):   1
  DB<1> print length "für"
4
  DB<2> q
> perl -Mutf8 -de 'print length "für", "\n"'

Loading DB routines from perl5db.pl version 1.39_10
Editor support available.

Enter h or 'h h' for help, or 'man perldebug' for more help.

main::(-e:1):   print length "für", "\n"
  DB<1> n
3
Debugged program terminated.  Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
  DB<1> q
>

use Encode as a Solution

The solution https://stackoverflow.com/a/78821233/6607497 works, but I wonder why I need to explain that the string is UTF-8 to perl:

  DB<1> use Encode qw(decode)

  DB<2> $x='für'

  DB<3> $y = decode('UTF-8', $x)

  DB<4> x length $x
0  4
  DB<5> x length $y
0  3

Solution

  • If the utf8 string is appearing as a string literal within the source, then you need to tell perl that the source file is utf8-encoded, with

    use utf8;
    

    If the string comes from another source, such as from a file or on the command line, then you need to tell perl that the file or command line is utf8-encoded, for example:

    open my $fh, '<:encoding(UTF-8)', $filename;
    

    and

    perl -CSA script utf8-arg ...
    

    For example,

    $ printf '\x66\xc3\xbc\x72' |
    perl -Mv5.14 -CSA -nle'say length'
    3
    
    $ printf '\x66\xc3\xbc\x72' |
    perl -Mv5.14 -nle'use open ":std", ":encoding(UTF-8)"; say length'
    3