Search code examples
perlwww-mechanize

Why does WWW::Mechanize GET certain pages but not others?


I'm new to Perl/HTML things. I'm trying to use $mech->get($url) to get something from a periodic table on http://en.wikipedia.org/wiki/Periodic_table but it kept returning error message like this:

Error GETing http://en.wikipedia.org/wiki/Periodic_table: Forbidden at PeriodicTable.pl line 13

But $mech->get($url) works fine if $url is http://search.cpan.org/.

Any help will be much appreciated!


Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";

$mech->get( $table_url );

Solution

  • It's because Wikipedia deny access to some programs based on the User-Agent supplied on the request.

    You can alias yourself to appear as a 'normal' web browser by setting the agent after instantiation and before the get(), for example:

    $mech->agent( 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8' );
    

    That worked for me with the URL in your posting. Shorter strings will probably work too.

    (You should remove the trailing slash from the URL too I think.)

    WWW::Mechanize is a subclass of LWP::UserAgent - see docs there for more info, including on the agent() method.

    You should limit your use of this method of access though. Wikipedia explicitly deny access to some spiders in their robots.txt file. The default user agent for LWP::UserAgent (which starts with libwww) is in the list.