Search code examples
perlgzipradixwww-mechanize

WWW::Mechanize ignores base href on gzipped content


As the title says WWW::Mechanize does not recognize

<base href="" /> 

if page content iz gzipped. Here is an example:

use strict;
use warnings;
use WWW::Mechanize;

my $url = 'http://objectmix.com/perl/356181-help-lwp-log-after-redirect.html';

my $mech = WWW::Mechanize->new;
$mech->get($url);
print $mech->base()."\n";

 # force plain text instead of gzipped content
$mech->get($url, 'Accept-Encoding' => 'identity');
print $mech->base()."\n";

Output:

http://objectmix.com/perl/356181-help-lwp-log-after-redirect.html
http://objectmix.com/    <--- this is correct !

Am I missing something here? Thanks

Edit: I just tested it directly with LWP::UserAgent and it works without any problems:

use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
my $res = $ua->get('http://objectmix.com/perl/356181-help-lwp-log-after-redirect.html');
print $res->base()."\n";

Output:

http://objectmix.com/ 

This looks like WWW::Mechanize bug?

Edit 2: It is LWP or HTTP::Response bug, not WWW::Mechanize. LWP does not request gzip by default. If I set

$ua->default_header('Accept-Encoding' => 'gzip'),

in the above example it returns wrong base

Edit 3: Bug is in LWP/UserAgent.pm in parse_head()

It calls HTML/HeadParser with gzipped HTML and HeadParser has no idea what to do with it. LWP should gunzip the content before calling parsing subroutine.


Solution

  • There is bug report about this: https://rt.cpan.org/Public/Bug/Display.html?id=54361

    Conclusion: LWP is missing this "feature".

    WWW::Mechanize:

    This could eventually be solved by overloading _make_request() in WWW::Mechanize with your own pkg and re-seting HTTP::Response by decoded_content or even dirtier by overwriting $mech->{base} with the parse base from content.