Search code examples
htmlperlencodingutf-8special-characters

Polish characters in Perl [HTML::TreeBuilder and utf8 input files]


I have been working with this for a week now and I still cannot find working solution. I parse html file which contains polish letters encoded in UTF-8. After extracting information I am interested in I save them to a file or print to the console but all of the polish characters are not displayed properly.

I have tried to use everything I found on Stack Overflow and other forums but things that work for other people for some reason don't work for me. I used:

use open qw(:std :utf8);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );
#and many others;

Here is my perl code:

use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
use File::Find;
use Encode;


my $location="C:\\MyLocation";
open (MYFILE, '>>data.txt');

sub find_txt {    

    my $F = $File::Find::name;

    if ($F =~ /index.html$/ ) {

       my $tr = HTML::TreeBuilder->new->parse_file('index.html');


        for my $div ($tr->look_down(_tag => 'h2', 'class' => 'featured')) {
           say $div->as_text;   
           print (MYFILE $div->as_text);
        }   

    for my $div ($tr->look_down(_tag => 'div', 'class' => 'post-content')) {
        for my $t ($div->look_down(_tag => 'p')) {
            say $t->as_text;
            print (MYFILE $t->as_text);
        }
    }       

    for my $div ($tr->look_down(_tag => 'h4', 'class' => 'related-posts')) {
        for my $t ($div->look_down(_tag => 'a')) {
            say $t->as_text;
            print (MYFILE $t->as_text);
        }
    }

}


}

find(\&find_txt, $location);
close (MYFILE);

and here is the piece of html file which causes problems:

<div class="post-content">
  <p>(łac. abacus)</p>
  <p>1. płyta będąca najwyższą częścią kolumny</p>
  <p>2. w starożytności &#8211; deska do liczenia, pierwowzór liczydła</p>

I am not sure if you will be able to display polish characters in your browser but the are some characters encoded by the unicode as 104, 106, 118, 141, 143, D3, 15A, 179, 17B, 105, 107, 119, 142, 144, F3, 15B, 17A, 17C


Solution

  • HTML::TreeBuilder parse_file - charset autodetection

    You may explicitly open file with given charset

    ...
    open (my $MYFILE, '>>:utf8','index.html'); # explicitly open MYFILE with utf8 charset
    ...
    my $tr = HTML::TreeBuilder->new->parse_file($MYFILE);
    ...
    

    OR Use IO::HTML for automatic charset detection of opened files.

    ...
    use IO::HTML;                 # exports html_file by default
    ...
    my $tr = HTML::TreeBuilder->new->parse_file(html_file('index.html'));
    ....
    

    man HTML::TreeBuilder

    parse_file
       ....
       When you pass a filename to "parse_file", HTML::Parser opens it in binary mode,
       which means it's interpreted as Latin-1 (ISO-8859-1).  If the file
       is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.
    ....
    SEE ALSO
       ....
       For opening a HTML file with automatic charset detection: IO::HTML.