perl parsing timeout image-resizing www-mechanize

Perl Mechanize - how to make a script running faster with less overhead

Problem: I have a list of 2500 websites and need to grab a thumbnail screenshot of them. How do I do that? I could try to parse the sites either with Perl.- Mechanize would be a good thing. Note: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension. At the moment i have a solution which is slow and does not give back thumbnails: How to make the script running faster with less overhead - spiting out the thumbnails

Prerequisites: addon/mozrepl/ the module WWW::Mechanize::Firefox; the module imager

First Approach: Here is a first Perl solution:

 use WWW::Mechanize::Firefox;
 my $mech = WWW::Mechanize::Firefox->new();
 $mech->get('http://google.com');
 my $png = $mech->content_as_png();

Outline: This returns the given tab or the current page rendered as PNG image. All parameters are optional. $tab defaults to the current tab. If the coordinates are given,that rectangle will be cut out. The coordinates should be a hash with the four usual entries, left,top,width,height.This is specific to WWW::Mechanize::Firefox.

As i understand from the perldoc that option with the coordinates, it is not the resize of the whole page it's just a rectangle cut out of it.... well the WWW::Mechanize::Firefox takes care for how to save screenshots. Well i forgot to mention that i only need to have the images as small thumbnails - so we do not have to have a very very large files...i only need to grab a thumbnail screenshot of them. I have done a lookup on cpan for some module that scales down the $png and i found out Imager

The mecha-module does not concern itself with resizing images. Here we have the various image modules on CPAN, like Imager. Imager - Perl extension for Generating 24 bit Images: Imager is a module for creating and altering images. It can read and write various image formats, draw primitive shapes like lines,and polygons, blend multiple images together in various ways, scale, crop, render text and more. I installed the module - but i did not have extended my basic-approach

What i have tried allready; here it is:

#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = new WWW::Mechanize::Firefox();

open(INPUT, "<urls.txt") or die $!;

while (<INPUT>) {
        chomp;
        print "$_\n";
        $mech->get($_);
        my $png = $mech->content_as_png();
        my $name = "$_";
        $name =~s/^www\.//;
        $name .= ".png";
        open(OUTPUT, ">$name");
        print OUTPUT $png;
        sleep (5);
}

Well this does not care about the size:

See the output commandline:

linux-vi17:/home/martin/perl # perl mecha_test_1.pl
   www.google.com
    www.cnn.com
    www.msnbc.com
command timed-out at /usr/lib/perl5/site_perl/5.12.3/MozRepl/Client.pm line 186
linux-vi17:/home/martin/perl #

This is my source ... see a snippet [example]of the sites i have in the url-list.

urls.txt [the list of sources ]

www.google.com
www.cnn.com
www.msnbc.com
news.bbc.co.uk
www.bing.com
www.yahoo.com

Question: how to extend the solution either to make sure that it does not stop in a time out. and - it does only store little thumbnails Note:again: i only need the results as a thumbnails that are a maximum 240 pixels in the long dimension. As a prerequisites, i allready have installed the module imager

How to make the script running faster with less overhead - spiting out the thumbnails

Love to hear from you! greetings zero

Update: in addition to Schwerms idea which are very very intersting i found a intersting Monkthread which talks about the same timeouts:

Is there a way to specify Net::Telnet timeout with WWW::Mechanize::Firefox? At the moment my internet connection is very slow and sometimes I get error with

 $mech->get(): command timed-out at /usr/local/share/perl/5.10.1/MozRepl/Client.pm line 186

Perhaps i have to loook after the mozrepl-Timeout-configuration!? But after all: This is weird and I don't know where that timeout comes from. Maybe it really is Firefox timing out as it is busy synchronously fetching some result. As you see in the trace, WWW::Mechanize::Firefox polls every second (or so) to see whether Firefox has fetched a page.

If it really is Net::Telnet, then you'll have to dive down:

$mech->repl->repl->client->{telnet}->timeout($new_timeout);

** Update** so the questions are: i mage usage of ** Net::Telnet:** which is in the Perl-Core

@ Alexandr Ciornii: thx for the hint! subsequently i would do it like this use: Net::Telnet; but if it is not in the core then i cannot go like this. @ Daxim: $ corelist Net::Telnet␤␤Net::Telnet was not in CORE - that means i cannot go like above

btw: like Øyvind Skaar mentioned: With that many url's we have to expect that some will fail and handle that. For example, we put the failed ones in an array or hash and retry them X times.

Solution

Look into Parallel::ForkManager which is one of the easier and more reliable ways to do parallel processing in Perl. Most of your work will be network and I/O bound, your CPU will be waiting around for the remote web server to return, and you're likely to get some big wins.

As for the timeout, that's somewhere inside MozRepl and defaults to 10 seconds. You'd either have to create a MozRepl::Client object with a different timeout and somehow get WWW::Mechanize::Firefox to use it, or you can do some undocumented things. This perlmonks thread shows how to change the timeout. There's also an undocumented MOZREPL_TIMEOUT environment variable which you can set.