I have been given a file, (probably) encoded in Latin-1 (ISO 8859-1), and there are some conversions and data mining to be done with it. The output is supposed to be in UTF-8, and I have tried about anything I could find about encoding conversion in Perl, none of them produced any usable output.
I know that use utf8;
does nothing to begin with. I have tried the Encode
package, which looked promising:
open FILE, '<', $ARGV[0] or die $!;
my %tmp = ();
my $last_num = 0;
while (<FILE>) {
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
chomp;
next unless length;
process($_);
}
I tried that in any combination I could think of, also thrown in a binmode(STDOUT, ":utf8");
, open FILE, '<:encoding(ISO-8859-1)', $ARGV[0] or die $!;
and much more. The result were either scrambled umlauts, or an error message like \xC3 is not a valid UTF-8 character
, or even mixed text (Some in UTF-8, some in Latin-1).
All I wanna have is a simple way to read in a Latin-1 text file and produce UTF-8 output on the console via print
. Is there any simple way to do that in Perl?
See Perl encoding introduction and the Unicode cookbook.
Easiest with piconv:
$ piconv -f Latin1 -t UTF-8 < input.file > output.file
Easy, with encoding layers:
use autodie qw(:all);
open my $input, '<:encoding(Latin1)', $ARGV[0];
binmode STDOUT, ':encoding(UTF-8)';
Moderately, with manual de-/encoding:
use Encode qw(decode encode);
use autodie qw(:all);
open my $input, '<:raw', $ARGV[0];
binmode STDOUT, ':raw';
while (my $raw = <$input>) {
my $line = decode 'Latin1', $raw, Encode::FB_CROAK | Encode::LEAVE_SRC;
my $result = process($line);
print {STDOUT} encode 'UTF-8', $result, Encode::FB_CROAK | Encode::LEAVE_SRC;
}