I have Mojo::DOM.
my $doc = Mojo::DOM->new(decode_utf8($html_page_content);
I want either of 2 things:
1) find all "a" tags that start with "/my_link", "/my_link2" or "/my_link3"
or
2) find all "a" tags, iterate over them and check whether a link starts with "/my_link", "/my_link2" or "/my_link3"
Whichever is more effective, if there's a big difference between them at all.
How can I do that?
I know how to find all the links:
$doc->find('a')->each(sub {
my $link = Mojo::URL->new($_);
# ....
You can use css selectors to narrow down your search to specific URLs. In particular, you'll want to search for links with the attribute href
( a[href]
) where the value of href
starts with a certain string (a[href^="..."]
). To search for several different URLs, just use a comma-separated list of selectors in $dom->find('...')
.
Here is an example that extracts links beginning with three different strings (I used URLs from this web page). You can adapt it to your own case:
my $dom = Mojo::DOM->new($page);
for my $url ( $dom->find('a[href^="https://stackoverflow.com"], a[href^="https://stackexchange.com"], a[href^="https://area51"]')->each ) {
say $url->attr('href'); # or do whatever you want to here
}
If you want to use your suggested method (2), fetch all links and filter them yourself, you could do so like this:
for my $url ( $dom->find('a[href^="https://"]')->each ) {
# substitute in your own regex here
if ( $url->attr('href') =~ /(stackoverflow|area51|codereview)/ ) {
say $url->attr('href'); # or whatever
}
}
It's unlikely that there will be much difference in efficiency between the two methods, and it's likely that you'd spend more time benchmarking them than you would gain by using whichever of the two is quicker.