Search code examples
perlrequestdecodelwplwp-useragent

Why is Perl HTTP::Response not decoding this apostrophe?


I'm using

my $ua = new LWP::UserAgent;
$ua->agent("Mozilla/5.0 (Windows NT 6.1; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ");
my $url = "http://somedomain.com/page/";
my $req = new HTTP::Request 'GET' => $url;
$req->header('Accept' => 'text/html');
my $response = $ua->request($req);
my $html = $response->decoded_content;

to get a web page. On this page, Abobo's Big Adventure appears. In $request->content and $request->decoded_content, this is shown as Abobo's Big Adventure.

Is there something I can do to make this decode correctly?


Solution

  • Why, that is completely valid HTML! However, you can decode the Entities using HTML::Entities from CPAN.

    use HTML::Entities;
    
    ...;
    my $html = $response->decoded_content;
    my $decoded_string = decode_entities($html);
    

    The docs for HTTP::Response::decoded_content state that the Content-encoding and charsets are reversed, not HTML entities (which are a HTML/XML language feature, not really an encoding).

    Edit:

    However, as ikegami pointed out, decoding the entities immediately could render the HTML unparsable. Therefore, it might be best to parse the HTML first (e.g. using HTML::Tree), and then only decoding the text nodes when needed.

    use HTML::TreeBuilder;
    
    my $url = ...;
    my $tree = HTML::TreeBuilder->new_from_url($url);    # invokes LWP automatically
    my $decoded_text = decode_entities($tree->as_text);  # dumps the tree as flat text, then decodes.