Search code examples
perlutf-8utf8-decode

What -C flag number in perl makes UTF-8 "just work"?


My setup: perl-5.20.2, UTF-8 environment.

Consider the following two bash examples. The first one works OK, the second doesn't.

echo -n 'привет мир' | perl -MEncode -le '$a=decode("utf8",<>); $x=decode("utf8","мир"); print encode("utf8",sprintf("% 11s",$a)) if $a=~/$x/'|grep -q ' привет мир' && echo OK
for (( i=0; $i < 512; i=$((i+1)) )); do echo -n 'привет мир' | perl -C$i -le '$a=<>; print sprintf("% 11s",$a) if $a=~/мир/' | grep -q ' привет мир' && echo $i; done

Why there is no -C flag number in case 2), which makes the example work at least once?


Solution

  • Why there is no -C flag number ... which makes the example work at least once?

    Because using UTF-8 literals in your Perl source requires use utf8;.

    for (( i=0; $i < 512; i=$((i+1)) )); do echo -n 'привет мир' | perl -C$i -le 'use utf8; $a=<>; print sprintf("% 11s",$a) if $a=~/мир/' | grep -q ' привет мир' && echo $i; done
    

    There's no -C value that replicates use utf8;. With use utf8 any odd value for -C passes the test (STDIN is assumed UTF-8), but you get a "Wide character in print" warning unless you also have STDOUT set to UTF-8.

    So, -C3 works, as does any number $i % 4 == 3. For 1-liners, you probably want -CSDA (-C63) to say that all I/O and @ARGV should be UTF-8.

    You can also use the -Mutf8 option instead of putting use utf8; in your 1-liner. -mutf8 does not work because it is equivalent to use utf8 (); and the parens prevent the import method from being called. Since it's the import method that marks your source code as UTF-8, -mutf8 does nothing. But -Mutf8 is equivalent to use utf8; so it works.

    However, putting -Mutf8 into PERL5OPT may break any script that uses non-ASCII ISO-8859-1 literals. That may be a risk you're willing to take, but you should be aware of it.