Search code examples
perlhtml-entitiesmojoliciousmovabletype

How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?


I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.

I'm writing those phrases out to a file, so they can be translated into other languages as follows:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

When Mojo::DOM encountered embedded HTML entities (such as ™ and  ) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.

I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."

If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text), or I must use some other technique to preserve the encoded HTML entities.

How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?


Solution

  • Through testing, my colleagues and I were able to determine that Mojo::DOM->new() was decoding ampersand characters (&) automatically, rendering the preservation of HTML Entities as written impossible. To get around this, we added the following subroutine to double encode ampersand:

    sub encode_amp {
        my ($text) = @_;
    
        ##########
        #
        # We discovered that we need to encode ampersand
        # characters being passed into Mojo::DOM->new() to avoid HTML entities being decoded
        # automatically by Mojo::DOM::Util::html_unescape().
        #
        # What we're doing is calling $dom = Mojo::DOM->new(encode_amp($string)) which double encodes
        # any incoming ampersand or & characters.
        #
        #
        ##########   
    
        $text .= '';           # Suppress uninitialized value warnings
        $text =~ s!&!&!g;  # HTML encode ampersand characters
        return $text;
    }
    

    Later in the script we pass $page->text through encode_amp() as we instantiate a new Mojo::DOM object.

        $dom = Mojo::DOM->new(encode_amp($page->text));
    
    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    # Note that "h2 b" is an important tag combination for capturing major headings on pages
    # in this theme. The tags "span" and "a" are also.
    #
    # We added caption and th to support tables.
    #
    # We added li and li a to support ol (ordered lists) and ul (unordered lists).
    #
    # We got the complicated map('descendant_nodes') logic from @Grinnz on StackOverflow, see:
    # https://stackoverflow.com/questions/55130871/how-do-i-most-reliably-preserve-html-entities-when-processing-html-documents-wit#comment97006305_55131737
    #
    #
    # Original set of selectors in $dom->find() below is as follows:
    #   'h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a'
    #
    ##########
    
        print FILE "\n\t### Body\n\n";        
    
        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->
            map('descendant_nodes')->map('each')->grep(sub { $_->type eq 'text' })->map('content')->uniq->each ) {           
    
            print_phrase($phrase);
    
        }
    

    The code block above incorporates previous suggestions from @Grinnz as seen in the comments in this question. Thanks also to @Robert for his answer, which had a good observation about how Mojo::DOM works.

    This code definitely works for my application.