Search code examples
htmlperlparsingurlcpan

How can I extract URL and link text from HTML in Perl?


I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

The output would be:

Google, http://www.google.com
Apple, http://www.apple.com

What is the best way to do this in Perl?


Solution

  • Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

    my $mech = WWW::Mechanize->new();
    $mech->get( $some_url );
    my @links = $mech->links();
    for my $link ( @links ) {
        printf "%s, %s\n", $link->text, $link->url;
    }
    

    Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

    Mech is basically a browser in an object.