Search code examples

Running perl hack from "Yahoo! Directory Mindshare in Google"

Has anyone run perl script given at ?

This is a famous one, used to get URLs from Yahoo! directory and many people have successfully used it.

I was trying to get URLs. I created my own Google API key and replaced that in the code. Apart from that I did not make any change.

Script is neither producing any error nor any URL.

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use HTML::LinkExtor;
use SOAP::Lite;

my $google_key  = "your API key goes here";
my $google_wdsl = "GoogleSearch.wsdl";
my $yahoo_dir   = shift || "/Computers_and_Internet/Data_Formats/XML_  _".

# download the Yahoo! directory.
my $data = get("" . $yahoo_dir) or die $!;

# create our Google object.
my $google_search = SOAP::Lite->service("file:$google_wdsl");
my %urls; # where we keep our counts and titles.

# extract all the links and parse 'em.

sub mindshare { # for each link we find...

  my ($tag, %attr) = @_;

  print "$tag\n";   

  # continue on only if the tag was a link,

  # and the URL matches Yahoo!'s redirectory.

  return if $tag ne 'a';   

  return unless $attr{href} =~ /;

  return unless $attr{href} =~ /\*http/;

  # now get our real URL.

  $attr{href} =~ /\*(http.*)/; my $url = $1;

  print "hi";

  # and process each URL through Google.

  my $results = $google_search->doGoogleSearch(

                      $google_key,"link:$url", 0, 1,

                      "true", "", "false", "", "", ""

                ); # wheee, that was easy, guvner.

  $urls{$url} = $results->{estimatedTotalResultsCount};

  print "1\n";


# now sort and display.

my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;

foreach my $url (@sorted_urls) { print "$urls{$url}: $url\n"; }

Program goes into the loop, and comes out at first iteration to "my @sorted_urls = sort { $urls{$b} <=> $urls{$a} } keys %urls;".

I don't have any understanding about perl but this task should have been trivial.

Surely,I am missing something very obvious, because this script has been successfully used by many.

Thanks in advance.


  • Are you supplying a directory to the script? Because if you are not, and this line in your script

    "/Computers_and_Internet/Data_Formats/XML_  _".

    is not a formatting artefact, then you're trying to scrape a non-existent page.