Search code examples
perlweb-scrapingwww-mechanize

WWW::Mechanize and iteration


i am trying to scrape info from http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1 from lines 1184 to 1325, basically the up coming games for the next 7 days. i have the code working for a single instance, but i can't figure out how to iterate the code so that it will scrape all the games info until it hits the end of the 7 day's worth of games. Is there some sort of loop i can create that will scrape until i hit a certain tag or something? Here is my code so far, thanks in advance!

my $page = WWW::Mechanize->new;

$page->get('http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1');

my $stream = HTML::TokeParser->new(\$page->{content});
my @fixture;
my $tag = $stream->get_tag("td");
while($tag->[1]{class} ne "dateTime"){
    $tag = $stream->get_tag("td");   
}

if ($tag->[1]{class} eq "dateTime") {
    push(@fixture, $stream->get_trimmed_text("/a"));
}

$stream->get_tag("a");
$stream->get_tag("a");
push(@fixture, $stream->get_trimmed_text("/a"));

$stream->get_tag("a");
push(@fixture, $stream->get_trimmed_text("/a"));  

foreach $element (@fixture){
print $element, "\t";
}
print "\n";  

Solution

  • Try Web::Query for parsing HTML, it is a much nicer to use than TokeParser. It works declarative instead of imperative and you select elements with CSS expressions.

    If there is a score v, add the row to the result set, else discard the row.

    use Web::Query 'wq';
    my $football_matches = wq($mech->content)
        ->find('tr.match')
        ->map(sub {
            my (undef, $e) = @_;
            return 'v' eq $e->find('td.score')->text
                ? [
                    $e->attr('id'),
                    map { $e->find("td.$_")->text }
                      (qw(tournament dateTime homeTeam score awayTeam prices))
                ]
                : ();
        });
    use Data::Dumper; print Dumper $football_matches;
    

    $VAR1 = [
        ['tn7gc635476', '', ' Mo 12Mar 2012 ', 'Arsenal',   'v', 'Newcastle', '  '],
        ['tn7gc649937', '', ' Tu 13Mar 2012 ', 'Liverpool', 'v', 'Everton',   '  '],
        ['tn7gc635681', '', ' Sa 17Mar 2012 ', 'Fulham',    'v', 'Swansea',   '  '],
        ['tn7gc635661', '', ' Sa 17Mar 2012 ', 'Wigan',     'v', 'West Brom', '  '],
        ['tn7gc635749', '', ' Su 18Mar 2012 ', 'Wolves',    'v', 'Man Utd',   '  '],
        ['tn7gc635556', '', ' Su 18Mar 2012 ', 'Newcastle', 'v', 'Norwich',   '  ']
    ];