Search code examples
perlwww-mechanize

Cancel Download using WWW::Mechanize in Perl


I have written a Perl script which would check a list of URLs and connect to them by sending a GET request.

Now, let's say that one of these URLs has a file which is very big in size, for instance, has a size > 100 MB.

When a request is sent to download this file using this:

$mech=WWW::Mechanize->new();
$url="http://somewebsitename.com/very_big_file.txt"
$mech->get($url)

Once the GET request is sent, it will start downloading the file. I want this to be cancelled using WWW::Mechanize. How can I do that?

I checked the documentation of this Perl Module here:

http://metacpan.org/pod/WWW::Mechanize

However, I could not find a method which would help me do this.

Thanks.


Solution

  • Aborting a GET request

    Using the :content_cb option, you can provide a callback function to get() that will be executed for each chunk of response content received from the server. You can set* the chunk size (in bytes) using the :read_size_hint option. These options are documented in LWP::UserAgent (get() in WWW::Mechanize is just an overloaded version of the same method in LWP::UserAgent).

    The following request will be aborted after reading 1024 bytes of response content:

    use WWW::Mechanize;
    
    sub callback {
        my ($data, $response, $protocol) = @_;
    
        die "Too much data";
    }
    
    my $mech = WWW::Mechanize->new;
    
    my $url = 'http://www.example.com';
    
    $mech->get($url, ':content_cb' => \&callback, ':read_size_hint' => 1024);
    
    print $mech->response()->header('X-Died');
    

    Output:

    Too much data at ./mechanize line 12.
    

    Note that the die in the callback does not cause the program itself to die; it simply sets the X-Died header in the response object. You can add the appropriate logic to your callback to determine under what conditions a request should be aborted.

    Don't even fetch URL if content is too large

    Based on your comments, it sounds like what you really want is to never send a request in the first place if the content is too large. This is quite different from aborting a GET request midway through, since you can fetch the Content-Length header with a HEAD request and perform different actions based on the value:

    my @urls = qw(http://www.example.com http://www.google.com);
    
    foreach my $url (@urls) {
        $mech->head($url);
    
        if ($mech->success) {
            my $length = $mech->response()->header('Content-Length') // 0;
    
            next if $length > 1024;
    
            $mech->get($url);
        }
    }
    

    Note that according to the HTTP spec, applications should set the Content-Length header. This does not mean that they will (hence the default value of 0 in my code example).


    * According to the documentation, the "protocol module which will try to read data from the server in chunks of this size," but I don't think it's guaranteed.