Search code examples
perlweb-scrapingmojolicious

Unable to Extract Link with Mojolicious


I am trying to extract the link for the next page in a search results page using Mojo::DOM. However, I have a problem where instead of Mojo::DOM elements, I get a string after using ->find() on an existing element.

I have:

my $pagination_elements = $dom->find("div[class*=\"pagination-block\"]");
my $page_counter_text = $pagination_elements->find("div[class=\"page-of-pages\"]")->text();

$page_counter_text =~ /^Page (\d+) of (\d+)$/;
my $current_page = int($1);
my $last_page = int($2);

my $prev_next_elements = $pagination_elements->find("a[class*=\"prev-next\"]");
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");

On each page, there may be 2 link tags with a class of prev-next. Instead of getting the link for the last element, what I get is a string that contains the href for both of the tags (if both are available on the page).

Now, if instead of this I do:

my $next_page_link = $dom->find("div[class*=\"pagination-block\"] > ul > li > a[class*=\"prev-next\"]")->last();

my $next_page_url_rel = $next_page_link->attr("href");

I get the required link.

My question is, why does the second version work and not the first? Why do I have to start from the root DOM element to get a list of elements, and why starting from a child of the root returns a string containing all the link tags instead of just the one I want?

Edit An example of the HTML I am parsing is:

<div class="pagination-block clearfix">
  <div class="page-of-pages">
    Page 2 of 100
  </div>

  <ul class="pagination-links">
    <li>
      .
      .
      .
    </li>

    <li>
      <a class="page-option prev-next" href="PREV LINK">Prev</a>
    </li>

    <li>
      <a class="page-option prev-next" href="NEXT LINK">Next</a>
    </li>
  </ul>
</div>

Solution

  • If you used Data::Dump (or some equivalent module) instead of print, you would get a clue as to what's going on:

    use Data::Dump;
    dd $next_page_url;
    dd $next_page_url_rel;
    

    Outputs:

    bless(["PREV LINK", "NEXT LINK"], "Mojo::Collection")
    "NEXT LINK"
    

    As you can see, your first variable actually holds a collection, and not a string.

    The problem arises because the Mojo::DOM->find returns a Mojo::Collection:

    my $pagination_elements = $dom->find('div[class*="pagination-block"]');
    

    Doing a subsequent find on a collection returns you a nested collection which is not going to perform the way you expect with calls like last.

    Here are three different solutions to fix your first attempt to find the link text:

    1. Use the Mojo::DOM->at method to find the first element in DOM structure matching the CSS selector.

      my $pagination_elements = $dom->at('div[class*="pagination-block"]');
      
    2. Use Mojo::Collection->first or ->last to isolate a specific element in the collection before the subsequent find.

      my $pagination_elements
          = $dom->find('div[class*="pagination-block"]')->last();
      
    3. Use Mojo::Collection->flatten to flatten the nested collections created by your subsequent find into a new collection with all elements:

      my $pagination_elements = $dom->find('div[class*="pagination-block"]');
      my $prev_next_elements
          = $pagination_elements->find('a[class*="prev-next"]')->flatten();
      

    All of these methods will make your script work as you intended:

    use strict;
    use warnings;
    
    use Mojo::DOM;
    use Data::Dump;
    
    my $dom = Mojo::DOM->new(do { local $/; <DATA> });
    
    # Fix 1
    my $pagination_elements = $dom->at('div[class*="pagination-block"]');
    
    # Fix 2
    #my $pagination_elements
    #    = $dom->find('div[class*="pagination-block"]')->last();
    
    # Fix 3
    #my $pagination_elements = $dom->find('div[class*="pagination-block"]');
    #my $prev_next_elements
    #    = $pagination_elements->find('a[class*="prev-next"]')->flatten();
    
    my $prev_next_elements = $pagination_elements->find('a[class*="prev-next"]');
    my $next_page_link     = $prev_next_elements->last();
    my $next_page_url      = $next_page_link->attr("href");
    
    dd $next_page_url;
    
    $next_page_link = $dom->find('div[class*="pagination-block"] > ul > li > a[class*="prev-next"]')->last();
    my $next_page_url_rel = $next_page_link->attr("href");
    
    dd $next_page_url_rel;
    
    __DATA__
    <html>
    <head>
    <title>Paging Example</title>
    </head>
    <body>
        <div class="pagination-block clearfix">
          <div class="page-of-pages">
            Page 2 of 100
          </div>
    
          <ul class="pagination-links">
            <li>
              .
              .
              .
            </li>
    
            <li>
              <a class="page-option prev-next" href="PREV LINK">Prev</a>
            </li>
    
            <li>
              <a class="page-option prev-next" href="NEXT LINK">Next</a>
            </li>
          </ul>
        </div>
    </body>
    </html>
    

    Outputs:

    "NEXT LINK"
    "NEXT LINK"