Search code examples
htmlperlmojo-dom

Mojo::DOM HTML extraction


I'm trying to extract quite a bit of data from a perfectly structured web page and struggling with Mojo::DOM methods. I would really appreciate it if anyone could point me in the right direction.

The truncated HTML with interesting data follows:

 <div class="post" data-story-id="3964117" data-visited="false">//extracting story-id
  <h2 class="post_title page_title"><a href="http://example.com/story/some_url" class="to-comments">header.</a></h2>
  //useless data and tags

<a href="http://example.com/story/some_url" class="b-story__show-all">
  <span>useless data</span>
</a>

<div class="post_tags">
  <ul>
    <li class="post_tag post_tag_strawberry hidden"><a href="http://example.com/search.php?n=32&r=3">&nbsp;</a></li>
    <li class="post_tag"><a href="http://example.com/tag/tag1/hot">tag1</a></li>
    <li class="post_tag"><a href="http://example.com/tag/tag2/hot">tag2</a></li>
    <li class="post_tag"><a href="http://example.com/tag/tag1/hot">tag3</a></li>
  </ul>
</div>

<div class="post_actions_box">

  <div class="post_rating_box">
    <ul data-story-id="3964117" data-vote="0" data-can-vote="true">
      <li><span class="post_rating post_rating_up control">&nbsp;</span></li>
      <li><span class="post_rating_count control label">1956</span></li> //1956 - interesting value
      <li><span class="post_rating post_rating_down control">&nbsp;</span></li>
    </ul>
  </div>

  <div class="post_more_box">
    <ul>
      <li>
        <span class="post_more control">&nbsp;</span>
      </li>
      <li>
        <a class="post_comments_count label to-comments" href="http://example.com/story/some_url#comments">132&nbsp;<i>&nbsp;</i></a>
      </li>
    </ul>
  </div>

</div>
</div>

What I have right now is

use strict;
use warnings;

use Data::Dumper;
use Mojo::DOM;


my $file = "index2.html";
local( $/, *FH ) ;
open( FH, $file ) or die "sudden flaming death\n";
my $text = <FH>;
my $dom = Mojo::DOM->new;
$dom->parse($text);
my $ids = $dom->find('div.post')
    ->each (sub {print $_->attr('data-story-id'), "\n";});
$dom->find('a.to-comments')->each (sub {print $_->text, "\n";});

This mess extracts data-story-id from the src and header value (tested the same with href value), but all my other attempts fail.

3964117
Header
132

"post_rating_count control label" is not extracted. I could get the first href values with searching for a.to-comments and returning attr('href'), but for some reason it also returnes me values of a link in the end of the segment with class="post_comments_count label to-comments". The same happens with header value extraction.

In the final end I am looking for an array with data structure with following fields:

  • story-id (this is is a success)
  • href (somehow, matching more than needed.)
  • header (somehow, matching more than needed.)
  • list of tags as a string (no idea how to do that)

What is more, I feel it is possible to optimize the code and make it look a bit better, but my kung-fu is not so strong.


Solution

  • Your HTML is malformed as I said in my comment. I've guessed where the missing <div> might go but I'm probably wrong. I've assumed the last </div> in the data corresponds to the first <div>, so that the whole block constitutes a single post

    The main problem you have is trying to do everything inside an each method call on your Mojo::Collection objects. It's far easier to use Perl to iterate of each collection, like this

    use strict;
    use warnings;
    
    use Mojo::DOM;
    
    use constant HTML_FILE => 'index2.html';
    
    my $html = do {
        open my $fh, '<', HTML_FILE or die $!;
        local $/;
        <$fh>;
    };
    
    my $dom = Mojo::DOM->new($html);
    
    for my $post ( $dom->find('div.post')->each ) {
    
        printf "Post ID:     %s\n", $post->attr('data-story-id');
    
        my $anchor = $post->at('h2.post_title > a');
        printf "Post href:   %s\n", $anchor->attr('href');
        printf "Post header: %s\n", $anchor->text;
    
        my @tags = $post->find('li.post_tag > a')->map('text')->each;
    
        printf "Tags:        %s\n", join ', ', @tags;
    
        print "\n";
    }
    

    output

    Post ID:     3964117
    Post href:   http://example.com/story/some_url
    Post header: Header
    Tags:        some_value, tag1, tag2, tag3