I am trying to extract the link for the next page in a search results page using Mojo::DOM. However, I have a problem where instead of Mojo::DOM elements, I get a string after using ->find()
on an existing element.
I have:
my $pagination_elements = $dom->find("div[class*=\"pagination-block\"]");
my $page_counter_text = $pagination_elements->find("div[class=\"page-of-pages\"]")->text();
$page_counter_text =~ /^Page (\d+) of (\d+)$/;
my $current_page = int($1);
my $last_page = int($2);
my $prev_next_elements = $pagination_elements->find("a[class*=\"prev-next\"]");
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");
On each page, there may be 2 link tags with a class of prev-next
. Instead of getting the link for the last element, what I get is a string that contains the href
for both of the tags (if both are available on the page).
Now, if instead of this I do:
my $next_page_link = $dom->find("div[class*=\"pagination-block\"] > ul > li > a[class*=\"prev-next\"]")->last();
my $next_page_url_rel = $next_page_link->attr("href");
I get the required link.
My question is, why does the second version work and not the first? Why do I have to start from the root DOM element to get a list of elements, and why starting from a child of the root returns a string containing all the link tags instead of just the one I want?
Edit An example of the HTML I am parsing is:
<div class="pagination-block clearfix">
<div class="page-of-pages">
Page 2 of 100
</div>
<ul class="pagination-links">
<li>
.
.
.
</li>
<li>
<a class="page-option prev-next" href="PREV LINK">Prev</a>
</li>
<li>
<a class="page-option prev-next" href="NEXT LINK">Next</a>
</li>
</ul>
</div>
If you used Data::Dump
(or some equivalent module) instead of print
, you would get a clue as to what's going on:
use Data::Dump;
dd $next_page_url;
dd $next_page_url_rel;
Outputs:
bless(["PREV LINK", "NEXT LINK"], "Mojo::Collection")
"NEXT LINK"
As you can see, your first variable actually holds a collection, and not a string.
The problem arises because the Mojo::DOM->find
returns a Mojo::Collection
:
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
Doing a subsequent find
on a collection returns you a nested collection which is not going to perform the way you expect with calls like last
.
Here are three different solutions to fix your first attempt to find the link text:
Use the Mojo::DOM->at
method to find
the first element in DOM structure matching the CSS selector.
my $pagination_elements = $dom->at('div[class*="pagination-block"]');
Use Mojo::Collection->first
or ->last
to isolate a specific element in the collection before the subsequent find
.
my $pagination_elements
= $dom->find('div[class*="pagination-block"]')->last();
Use Mojo::Collection->flatten
to flatten the nested collections created by your subsequent find
into a new collection with all elements:
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $prev_next_elements
= $pagination_elements->find('a[class*="prev-next"]')->flatten();
All of these methods will make your script work as you intended:
use strict;
use warnings;
use Mojo::DOM;
use Data::Dump;
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# Fix 1
my $pagination_elements = $dom->at('div[class*="pagination-block"]');
# Fix 2
#my $pagination_elements
# = $dom->find('div[class*="pagination-block"]')->last();
# Fix 3
#my $pagination_elements = $dom->find('div[class*="pagination-block"]');
#my $prev_next_elements
# = $pagination_elements->find('a[class*="prev-next"]')->flatten();
my $prev_next_elements = $pagination_elements->find('a[class*="prev-next"]');
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");
dd $next_page_url;
$next_page_link = $dom->find('div[class*="pagination-block"] > ul > li > a[class*="prev-next"]')->last();
my $next_page_url_rel = $next_page_link->attr("href");
dd $next_page_url_rel;
__DATA__
<html>
<head>
<title>Paging Example</title>
</head>
<body>
<div class="pagination-block clearfix">
<div class="page-of-pages">
Page 2 of 100
</div>
<ul class="pagination-links">
<li>
.
.
.
</li>
<li>
<a class="page-option prev-next" href="PREV LINK">Prev</a>
</li>
<li>
<a class="page-option prev-next" href="NEXT LINK">Next</a>
</li>
</ul>
</div>
</body>
</html>
Outputs:
"NEXT LINK"
"NEXT LINK"