Search code examples
perlmojo-dom

Using Mojo::DOM to extract untagged text after heading


I'm trying to extract some text without tags from a HTML file using Mojo::DOM (I'm new at this). In particular, the description text after the H2 heading (there are other headings in the file).

<h2>Description</h2>This text is the description<div class="footer">[<a href="/contrib/rev/1597/2795/">Edit description</a>

I've been able to find the heading, but don't know how to access the text after is, since I have not tag to jump to...

my $dom = Mojo::DOM->new( $htmlfile );
my $desc = $dom
    ->find('h2')
    ->grep(sub { $_->all_text =~ /Description/ })
    ->first;

Can anyone recommend to me a way how to grab the "This text is the description" string?


Solution

  • One can go through all nodes, what also catches those which aren't inside an HTML element (tag). Then use the fact that you need the node that follows the h2 tag.

    More precisely, it follows the text-node which is the child of the (identifiable) h2 tag-node.

    use warnings;
    use strict;
    use feature 'say';
    
    use Mojo::DOM;
    
    my $html = q(<h2>Description</h2> This text is the description <p>More...</p>);
    
    my $dom = Mojo::DOM->new($html);
    
    my $is_next = 0;
    
    foreach my $node ($dom->descendant_nodes->each) { 
        my $par = $node->parent;
        if ($node->type eq 'text' and $par->type eq 'tag' and $par->tag eq 'h2') { 
            $is_next = 1;
        }   
        elsif ($is_next) {
            say $node;       #-->   This text is the description
            $is_next = 0;
        }   
    }
    

    More criteria for exactly which h2 nodes are of interest can be added (unless it's really all such nodes), by interrogating either the previous text-node (text of the h2 tag) or its parent (the tag).

    The node itself should likely be checked as well, for example to see whether it's indeed just loose text and not actually a next tag.

    I've tested with far more complex HTML; the above is a near-minimal testable markup.


    In this simple example just $dom->text catches the needed text. However, that won't be the case in more complex fragments where the sought text doesn't come after the very first element.