Search code examples
perlutf-8character-encodinglatin1

Proper handing of UTF-8 in Perl


I have been given a file, (probably) encoded in Latin-1 (ISO 8859-1), and there are some conversions and data mining to be done with it. The output is supposed to be in UTF-8, and I have tried about anything I could find about encoding conversion in Perl, none of them produced any usable output.

I know that use utf8; does nothing to begin with. I have tried the Encode package, which looked promising:

open FILE, '<', $ARGV[0] or die $!;

my %tmp = ();
my $last_num = 0;

while (<FILE>) {
    $_ = decode('ISO-8859-1', encode('UTF-8', $_));

    chomp;
    next unless length;
    process($_);
}

I tried that in any combination I could think of, also thrown in a binmode(STDOUT, ":utf8");, open FILE, '<:encoding(ISO-8859-1)', $ARGV[0] or die $!; and much more. The result were either scrambled umlauts, or an error message like \xC3 is not a valid UTF-8 character, or even mixed text (Some in UTF-8, some in Latin-1).

All I wanna have is a simple way to read in a Latin-1 text file and produce UTF-8 output on the console via print. Is there any simple way to do that in Perl?


Solution

  • See Perl encoding introduction and the Unicode cookbook.

    • Easiest with piconv:

      $ piconv -f Latin1 -t UTF-8 < input.file > output.file
      
    • Easy, with encoding layers:

      use autodie qw(:all);
      open my $input, '<:encoding(Latin1)', $ARGV[0];
      binmode STDOUT, ':encoding(UTF-8)';
      
    • Moderately, with manual de-/encoding:

      use Encode qw(decode encode);
      use autodie qw(:all);
      
      open my $input, '<:raw', $ARGV[0];
      binmode STDOUT, ':raw';
      while (my $raw = <$input>) {
          my $line = decode 'Latin1', $raw, Encode::FB_CROAK | Encode::LEAVE_SRC;
          my $result = process($line);
          print {STDOUT} encode 'UTF-8', $result, Encode::FB_CROAK | Encode::LEAVE_SRC;
      }