Search code examples
perlencoding

Perl - Encoding error when working with .html file


I have some .html files in a directory to which I want to add one line of css code. Using perl, I can locate the position with a regex and add the css code, this works very well.

However, my first .html file contain an accented letter: é but the resulting .html file has an encoding problem and prints: \xE9

In the perl file, I have been careful to specify UTF-8 encoding when opening and closing the files, has shown in the MWE below, but that does not solve the problem. How can I solve this encoding error?

MWE

use strict;
use warnings;
use File::Spec::Functions qw/ splitdir rel2abs /; # To get the current directory name

# Define variables
my ($inputfile, $outputfile, $dir);

# Initialize variables
$dir = '.';

# Open current directory
opendir(DIR, $dir);

# Scan all files in directory
while (my $inputfile = readdir(DIR)) {
    
    #Name output file based on input file
    $outputfile = $inputfile;
    $outputfile =~ s/_not_centered//;
    
    # Open output file
    open(my $ofh, '>:encoding(UTF-8)', $outputfile);

    # Open only files containning ending in _not_centered.html
    next unless (-f "$dir/$inputfile");
    next unless ($inputfile =~ m/\_not_centered.html$/);
    
    # Open input file
    open(my $ifh, '<:encoding(UTF-8)', $inputfile);
    
    # Read input file
    while(<$ifh>) {
        # Catch and store the number of the chapter
        if(/(<h2)(.*?)/) {
            # $_ =~ s/<h2/<h2 style="text-align: center;"/;
            print $ofh "$1 style=\"text-align: center;\"$2";
        }else{
            print $ofh "$_";
        }
    }
    
    # Close input and output files
    close $ifh;
    close $ofh;
}

# Close output file and directory
closedir(DIR);

Problematic file named "Chapter_001_not_centered.html"

<html > 
<head></head>
<body>
                                                           
<h2 class="chapterHead"><span class="titlemark">Chapter&#x00A0;1</span><br /><a id="x1-10001"></a>Brocéliande</h2>
Brocéliande

</body></html>

Solution

  • This question found an answer in the commments of @Shawn and @ sticky bit:

    By changing the encoding to open and close the files to ISO 8859-1, it solves the problem. If one of you wants to post the answer, I will validate it.