Search code examples
perlfile-ioutf-8iso-8859-1

Read UTF-8 in Perl and output as ISO-8859-1


I have to read a text file in Perl which is encoded as UTF-8, this is working fine. My output file OUT_2 has to be encoded as ISO-8859-1 (aka "Latin1"). I tried this code (and some more) but my output file OUT_2 is always written as UTF-8. Any idea to achieve it?

use strict;
use Encode::Encoder;

open IN, "c:/Temp/Input.txt"; # this file is UTF-8

open OUT_1, ">", "c:/Temp/out_1.txt"; 
# encoding of OUT_1 does not matter because it contains only ASCII
open OUT_2, ">:encoding(latin1)", "c:/Temp/out_2.txt"; 

my $line = 1;
while ( <IN> ) {
    chomp;
    print OUT_1 "Write line $line\n";
    print OUT_2 "$_ and some stuff\n";
    $line++;
}

close IN;
close OUT_1;
close OUT_2;

This proposal does not work either:

 my $data = "$_ and some stuff\n";
 Encode::encode("latin1", Encode::decode("UTF-8", $data));
 print OUT_2 $data;

Solution

  • This seems to work correctly (see the description of Perl's open function; there is no need to explicitly transform the Perl string on the octet level using encode/decode) (further afield, possibly see the description of the open pragma and the binmode function):

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    open my $in,  '<:encoding(UTF-8)',  'input-file-name'  or die $!;
    open my $out, '>:encoding(latin1)', 'output-file-name' or die $!;
    
    while (<$in>) {
      print $out $_;
    }
    

    The only substantive difference from your code is that I am explicitly decoding the incoming data from UTF8 bytes to characters.

    What are you doing to find out what the encodings of your input and output files are? I've used file.

    $ file input-file-name output-file-name
    input-file-name: UTF-8 Unicode text
    output-file-name:  ISO-8859 text
    

    And also od -ch:

    $ od -ch input-file-name
    0000000   a   a   a 302 243 302 243 302 243   z   z   z  \n
               6161    c261    c2a3    c2a3    7aa3    7a7a    000a
    0000015
    
    $ od -ch output-file-name
    0000000   a   a   a 243 243 243   z   z   z  \n
               6161    a361    a3a3    7a7a    0a7a
    0000012
    

    (My file contained "aaa£££zzz".)