Search code examples
perlerror-handlingwww-mechanize

Handling GET errors in WWW::Mechanize


I'm using a script that scrapes data from a website using WWW::Mechanize and it's all working great, except for the website itself. Sometimes it just doesn't respond for a short moment and for a given my $url = 'http://www.somesite.com/more/url/text' I will have this error on $mech->get($url):

Error GETing http://www.somesite.com/more/url/text: Can't connect to www.somesite.com:443 at ./trackSomesite.pl line 34.

This error is something that occurs once in a while with no recognizable pattern and from my experience with the website I'm dealing with, it's because of server instabilities.

I want to be able to know specifically that this error occurred and not other errors like Too many requests. My question is how can I get my script to handle this error and not die?


Solution

  • Either wrap your $mech->get(...) requests in an eval block or use autocheck => 0, then check the $mech->status code and/or $mech->status_line to decide what to do.

    Here is an example:

    #!/usr/bin/env perl
    
    use WWW::Mechanize;
    
    use constant RETRY_MAX => 5;
    
    my $url = 'http://www.xxsomesite.com/more/url/text'; # Cannot connect
    
    my $mech = WWW::Mechanize->new( autocheck => 0 );
    
    my $content = fetch($url);
    
    sub fetch {
        my ($url) = @_;
    
        for my $retry (0 .. RETRY_MAX-1) {
            my $message = "Attempting to fetch [ $url ]";
            $message .= $retry ? " - retry $retry\n" : "\n";
            warn $message;
    
            my $response = $mech->get($url);
            return $response->content() if $response->is_success();
    
            my $status = $response->status;
            warn "status = $status\n";
    
            if ($response->status_line =~ /Can['']t connect/) {
                $retry++;
                warn "cannot connect...will retry after $retry seconds\n";
                sleep $retry;
            } elsif ($status == 429) {
                warn "too many requests...ignoring\n";
                return undef;
            } else {
                warn "something else...\n";
                return undef;
            }
        }
    
        warn "giving up...\n";
        return undef;
    }
    

    Output

    Attempting to fetch [ http://www.xxsomesite.com/more/url/text ]
    status = 500
    cannot connect...will retry after 1 seconds
    Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 1
    status = 500
    cannot connect...will retry after 2 seconds
    Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 2
    status = 500
    cannot connect...will retry after 3 seconds
    Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 3
    status = 500
    cannot connect...will retry after 4 seconds
    Attempting to fetch [ http://www.xxsomesite.com/more/url/text ] - retry 4
    status = 500
    cannot connect...will retry after 5 seconds
    giving up...