Search code examples
perlweb-scrapingperl-modulewww-mechanize

Perl WWW:Mechanize/HTML:TokeParser and following/storing URL from href attr


i am making some good progress with Perl due to the help on this site but i've run into a problem. One of the pages i was scraping from has changed and i can't figure out how to get to it now. What i want to do is store a link to each page i want to get to. The problem is that these links are inside the a href attribute tags in the source code and i have no idea how to extract them. Could anyone help me?

the links i need are from line 316 to 354 of this page(source code) http://www.soccerbase.com/teams/home.sd

i need to basically extract the links to variables for use in my other scripts. As mentioned i am using WWW::Mechanize and HTML::TokeParser, hopefully there are methods within these that i can use but can't currently figure out. Thanks in advance!


Solution

  • See method find_all_links in WWW::Mechanize. No need to bother manually with the parser. You probably want to relax the regex so that you get all ~1000 possible teams at once.

    use WWW::Mechanize qw();
    my $w = WWW::Mechanize->new;
    $w->get('http://www.soccerbase.com/teams/home.sd');
    for my $link ($w->find_all_links(url_regex => qr/comp_id=1\b/)) {
        # 20 instances of WWW::Mechanize::Link
        printf "URL=%s\tTeam=%s\n", $link->url_abs, $link->text
    }
    

    URL=http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1       Team=Premier League
    URL=http://www.soccerbase.com/teams/team.sd?team_id=142&comp_id=1       Team=Arsenal
    URL=http://www.soccerbase.com/teams/team.sd?team_id=154&comp_id=1       Team=Aston Villa
    URL=http://www.soccerbase.com/teams/team.sd?team_id=308&comp_id=1       Team=Blackburn
    URL=http://www.soccerbase.com/teams/team.sd?team_id=354&comp_id=1       Team=Bolton
    URL=http://www.soccerbase.com/teams/team.sd?team_id=536&comp_id=1       Team=Chelsea
    URL=http://www.soccerbase.com/teams/team.sd?team_id=942&comp_id=1       Team=Everton
    URL=http://www.soccerbase.com/teams/team.sd?team_id=1055&comp_id=1      Team=Fulham
    URL=http://www.soccerbase.com/teams/team.sd?team_id=1563&comp_id=1      Team=Liverpool
    URL=http://www.soccerbase.com/teams/team.sd?team_id=1718&comp_id=1      Team=Man City
    URL=http://www.soccerbase.com/teams/team.sd?team_id=1724&comp_id=1      Team=Man Utd
    URL=http://www.soccerbase.com/teams/team.sd?team_id=1823&comp_id=1      Team=Newcastle
    URL=http://www.soccerbase.com/teams/team.sd?team_id=1855&comp_id=1      Team=Norwich
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2093&comp_id=1      Team=QPR
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2477&comp_id=1      Team=Stoke
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2493&comp_id=1      Team=Sunderland
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2513&comp_id=1      Team=Swansea
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2590&comp_id=1      Team=Tottenham
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2744&comp_id=1      Team=West Brom
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2783&comp_id=1      Team=Wigan
    URL=http://www.soccerbase.com/teams/team.sd?team_id=2848&comp_id=1      Team=Wolves