I have a multiline $string
variable that contains UTF-8 csv. I open this string as a file for processing and print its contents.
open(my $fh, "<", \$string);
$/=undef;
say <$fh>;
With hexdump I see the text is UTF-8 (É
is c3 89
).
Now I read the string through Text::CSV.
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
my $line;
$csv->say(\*STDOUT, $line) while ($line = $csv->getline($fh));
É
char has become c9
(Unicode?). If I print that to my console I'm getting �
instead of É
.
I use perl 5.28.0
.
Why is Text::CSV altering encoding and how to avoid it?
EDIT
I've made progress, thanks to @Gilles Quénot and @ikegami, and some trial and error.
What happened is that Text::CSV converted my strings into perl internal format. Strings in perl's internal format won't be output correctly to my utf8 terminal unless I use open ':std', ':encoding(UTF-8)';
. This directive is apparently needed in my program main file only.
Another problem I had (absent from my example) was that I needed use utf8
in all source files to convert my program literals into perl internal format. Without it, comparisons such as "É" eq $some_var
fail because the former will be utf8 (because of my editor saving to that format) and the latter will be in perl's internal format.
Another problem I encountered was stacked decoding. Once use open ':std', ':encoding(UTF-8)';
is in place, any other encoding instruction must be removed from the program (the symptom I had: chars output as 4 bytes instead of 2).
EDIT 2
Here are simple tests that really helped me understand.
# no conversion to internal perl string format
$ perl -M'5.28.0' -e 'say "É"' | hexdump -C
00000000 c3 89 0a |...|
00000003
# string literals converted to perl string format,
# but no conversion of output to terminal
# results in �
$ perl -Mutf8 -M'5.28.0' -e 'say "É"' | hexdump -C
00000000 c9 0a |..|
00000002
# string literals converted to perl string format,
# AND conversion of output
$ perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -e 'say "É"' |hexdump -C
00000000 c3 89 0a |...|
00000003
And finally
# entirely transparent because input is decoded
# and reencoded on output
# use utf8 has no effect in this very basic example
$ echo É | perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -pne '' |hexdump -C
00000000 c3 89 0a |...|
00000003
We have to assume strings are converted to perl internal format at some point.
Try to add this line after the shebang:
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
See
https://stackoverflow.com/a/15147306/465183
https://perldoc.perl.org/feature#The-'unicode_strings'-feature
Why does modern Perl avoid UTF-8 by default?