Search code examples
perlweb-crawlerwww-mechanize

Perl mechanize Find all links array loop issue


I am currently attempting to create a Perl webspider using WWW::Mechanize.

What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.

But I have a problem with how to spider the whole site to get every link, without duplicates What I have done so far (the part im having trouble with anyway):

foreach (@nonduplicates) {   #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);  #find all links on this page that starts with http://www.tree.com

#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be 

How would I be able to do the above?

I am doing this to try and spider the whole site to get a comprehensive list of every URL on the site, without duplicates.

If you think this is not the best/easiest method of achieving the same result I'm open to ideas.

Your help is much appreciated, thanks.


Solution

  • Create a hash to track which links you've seen before and put any unseen ones onto @nonduplicates for processing:

    $| = 1;
    my $scanned = 0;
    
    my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
    my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.
    
    while (my $queued_link = pop @nonduplicates) {
        $mech->get($queued_link);
        my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);
    
        for my $new_link (@list) {
            # Add the link to the queue unless we already encountered it.
            # Increment so we don't add it again.
            push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
        }
        printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
    }
    use Data::Dumper;
    print Dumper(\%link_tracker);