Implementing a line wrapping algorithm I realized that Perl's length
function returns unexpected results if the input uses umlauts, like this (so the three-character string "für" has a length of four):
> perl -e 'print length "für" == 4, "\n"'
1
In detail (in Perl debugger):
DB<7> x length "für"
0 4
DB<8> foreach (0..3) { print substr('für', $_), "\n" }
für
ür
▒r
r
DB<9> q
The obvious question is: How can I make Perl to interpret the string as UTF-8 string?
The manual suggests it should automatically.
"perluniintro(1)" states in " Perl's Unicode Model":
Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of Unicode characters. The general principle is that Perl tries to keep its data as eight-bit bytes for as long as possible, but as soon as Unicodeness cannot be avoided, the data is transparently upgraded to Unicode. Prior to Perl v5.14.0, the upgrade was not completely transparent (see "The "Unicode Bug"" in perlunicode), and for backwards compatibility, full transparency is not gained unless "use feature 'unicode_strings'" (see feature) or "use 5.012" (or higher) is selected.
The code where I found the issue explicitly uses require 5.018_000;
, so shouldn't Perl "do the right thing" automatically?
(I found an older question of mine (printf aligning problem with degree (°) character) with a similar problem, but connected to printing strings. I don't see how the answers apply here)
As I had been asked for the bytes in the string, here is what I got:
> perl -e "print 'für'" | od -x
0000000 c366 72bc
0000004
> perl -e "print unpack('H*', 'für');" -e 'print "\n"'
66c3bc72
> perl -e "print 'für'" | hexdump -C
00000000 66 c3 bc 72 |f..r|
00000004
use utf8
as a SolutionIt had been suggested to use utf8
, and actually I had been trying it, but it doesn't really work, and I don't understand either:
> perl -de1
Loading DB routines from perl5db.pl version 1.39_10
Editor support available.
Enter h or 'h h' for help, or 'man perldebug' for more help.
main::(-e:1): 1
DB<1> use utf8
DB<2> print length "für"
4
DB<3> q
> perl -Mutf8 -de1
Loading DB routines from perl5db.pl version 1.39_10
Editor support available.
Enter h or 'h h' for help, or 'man perldebug' for more help.
main::(-e:1): 1
DB<1> print length "für"
4
DB<2> q
> perl -Mutf8 -de 'print length "für", "\n"'
Loading DB routines from perl5db.pl version 1.39_10
Editor support available.
Enter h or 'h h' for help, or 'man perldebug' for more help.
main::(-e:1): print length "für", "\n"
DB<1> n
3
Debugged program terminated. Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
DB<1> q
>
use Encode
as a SolutionThe solution https://stackoverflow.com/a/78821233/6607497 works, but I wonder why I need to explain that the string is UTF-8 to perl:
DB<1> use Encode qw(decode)
DB<2> $x='für'
DB<3> $y = decode('UTF-8', $x)
DB<4> x length $x
0 4
DB<5> x length $y
0 3
If the utf8 string is appearing as a string literal within the source, then you need to tell perl that the source file is utf8-encoded, with
use utf8;
If the string comes from another source, such as from a file or on the command line, then you need to tell perl that the file or command line is utf8-encoded, for example:
open my $fh, '<:encoding(UTF-8)', $filename;
and
perl -CSA script utf8-arg ...
For example,
$ printf '\x66\xc3\xbc\x72' |
perl -Mv5.14 -CSA -nle'say length'
3
$ printf '\x66\xc3\xbc\x72' |
perl -Mv5.14 -nle'use open ":std", ":encoding(UTF-8)"; say length'
3