I am using HTML::TreeBuilder
to extract contents of a url by using tree->lookdown
and then extracting text part from the string returned in lookdown method. My problem here is when I read that text and write it into a file its showing as junk. I am not able to make a progress regarding this.
My Sample Code:
use HTML::TreeBuilder;
use HTML::Element;
use utf8;
$url = $ARGV[0];
$page = `wget -qO - "$url"| tee data.txt`;
#print "iam $page\n";
my $tree = HTML::TreeBuilder->new( );
$tree->parse_file('data.txt');
my @story = $tree->look_down(
_tag => 'div',
class => 'storydescription'
);
my @title = $tree->look_down(
_tag => 'title'
);
open(OUT,">","story.txt") or die"Cannot open story.txt:$!\n";
binmode(OUT,":utf8");
foreach my $story(@story) {
print OUT $story->as_text;
}
close(OUT);
I have tried binmode for the output file handle but it was of no use and the text other than Unicode such as ascii characters prints properly into file.
It's documented in HTML::TreeBuilder:
When you pass a filename to
parse_file
,HTML::Parser
opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.One solution is to open the file yourself using the proper
:encoding
layer, and pass the filehandle toparse_file
. You can automate this process by using "html_file" inIO::HTML
, which will use the HTML5 encoding sniffing algorithm to automatically determine the proper:encoding
layer and apply it.