I'm using
my $ua = new LWP::UserAgent;
$ua->agent("Mozilla/5.0 (Windows NT 6.1; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ");
my $url = "http://somedomain.com/page/";
my $req = new HTTP::Request 'GET' => $url;
$req->header('Accept' => 'text/html');
my $response = $ua->request($req);
my $html = $response->decoded_content;
to get a web page. On this page, Abobo's Big Adventure
appears. In $request->content
and $request->decoded_content
, this is shown as Abobo's Big Adventure
.
Is there something I can do to make this decode correctly?
Why, that is completely valid HTML! However, you can decode the Entities using HTML::Entities
from CPAN.
use HTML::Entities;
...;
my $html = $response->decoded_content;
my $decoded_string = decode_entities($html);
The docs for HTTP::Response::decoded_content
state that the Content-encoding
and charsets are reversed, not HTML entities (which are a HTML/XML language feature, not really an encoding).
However, as ikegami pointed out, decoding the entities immediately could render the HTML unparsable. Therefore, it might be best to parse the HTML first (e.g. using HTML::Tree
), and then only decoding the text nodes when needed.
use HTML::TreeBuilder;
my $url = ...;
my $tree = HTML::TreeBuilder->new_from_url($url); # invokes LWP automatically
my $decoded_text = decode_entities($tree->as_text); # dumps the tree as flat text, then decodes.