Perl mechanize Find all links array loop issue

I am currently attempting to create a Perl webspider using WWW::Mechanize.

What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.

But I have a problem with how to spider the whole site to get every link, without duplicates What I have done so far (the part im having trouble with anyway):

foreach (@nonduplicates) {   #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);  #find all links on this page that starts with http://www.tree.com

#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be

How would I be able to do the above?

I am doing this to try and spider the whole site to get a comprehensive list of every URL on the site, without duplicates.

If you think this is not the best/easiest method of achieving the same result I'm open to ideas.

Your help is much appreciated, thanks.

Solution

Create a hash to track which links you've seen before and put any unseen ones onto @nonduplicates for processing:

$| = 1;
my $scanned = 0;

my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.

while (my $queued_link = pop @nonduplicates) {
    $mech->get($queued_link);
    my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

    for my $new_link (@list) {
        # Add the link to the queue unless we already encountered it.
        # Increment so we don't add it again.
        push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
    }
    printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);