Search code examples
perlcharacter-encoding

Perl character encoding confusion


I have a multiline $string variable that contains UTF-8 csv. I open this string as a file for processing and print its contents.

open(my $fh, "<", \$string);
$/=undef;
say <$fh>;

With hexdump I see the text is UTF-8 (É is c3 89).

Now I read the string through Text::CSV.

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
my $line;
$csv->say(\*STDOUT, $line) while ($line = $csv->getline($fh));

É char has become c9 (Unicode?). If I print that to my console I'm getting instead of É.

I use perl 5.28.0.

Why is Text::CSV altering encoding and how to avoid it?

EDIT

I've made progress, thanks to @Gilles Quénot and @ikegami, and some trial and error.

What happened is that Text::CSV converted my strings into perl internal format. Strings in perl's internal format won't be output correctly to my utf8 terminal unless I use open ':std', ':encoding(UTF-8)';. This directive is apparently needed in my program main file only.

Another problem I had (absent from my example) was that I needed use utf8 in all source files to convert my program literals into perl internal format. Without it, comparisons such as "É" eq $some_var fail because the former will be utf8 (because of my editor saving to that format) and the latter will be in perl's internal format.

Another problem I encountered was stacked decoding. Once use open ':std', ':encoding(UTF-8)'; is in place, any other encoding instruction must be removed from the program (the symptom I had: chars output as 4 bytes instead of 2).

EDIT 2

Here are simple tests that really helped me understand.

# no conversion to internal perl string format
$ perl -M'5.28.0' -e 'say "É"' | hexdump -C
00000000  c3 89 0a                                          |...|
00000003

# string literals converted to perl string format,
# but no conversion of output to terminal
# results in �
$ perl -Mutf8 -M'5.28.0' -e 'say "É"' | hexdump -C
00000000  c9 0a                                             |..|
00000002

# string literals converted to perl string format,
# AND conversion of output
$ perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -e 'say "É"' |hexdump -C
00000000  c3 89 0a                                          |...|
00000003

And finally

# entirely transparent because input is decoded 
# and reencoded on output
# use utf8 has no effect in this very basic example
$ echo É | perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -pne '' |hexdump -C
00000000  c3 89 0a                                          |...|
00000003

We have to assume strings are converted to perl internal format at some point.


Solution

  • Try to add this line after the shebang:

    # Tell Perl your code is encoded using UTF-8.
    use utf8;
    
    # Tell Perl input and output is encoded using UTF-8.
    use open ':std', ':encoding(UTF-8)';
    

    See
    https://stackoverflow.com/a/15147306/465183
    https://perldoc.perl.org/feature#The-'unicode_strings'-feature
    Why does modern Perl avoid UTF-8 by default?