Search code examples
htmlregexperl

Perl: Removing leftover anchor tags from a string


I wrote a script to check all URLs in an HTML file and remove them with HTTP::Tiny and delete, but it left a bunch of incomplete anchor tags like <a>text here</a> throughout the database. These changes have already been made and the script doesn't need to be run again.

The script read from a database where this info is stored.

The goal now is to strip the <a>text here</a> while leaving everything else in the buffer plus the "text here" text.

I wrote another script that reads the database and includes the following regex:

$html_buffer =~ s/(.*?)<a>(.*?)<\/a>(.*?)/$1$2$3/g;

but it doesn't work and I don't know why. Here's an example buffer:

This week, perhaps the most interesting articles include &quot;<a>Finding \r\n  that Windows is superior to Linux is biased</a>,&quot; &quot;<a href=\"http://www.example.com/content/view/118693\">How \r\n  to set up DNS for Linux VPNs</a>,&quot; and &quot;<a href=\"http://www.example.com/content/view/118664
\">Writing \r\n  an Incident Handling and Recovery Plan</a>.&quot;

Of course I want the regex to operate throughout the string, but also match multiple occurrences within the string.

Is this the most effective way to do this? How can I be sure to not remove the </a> at the end of the string like in the above example?


Solution

  • In principle, it is always advisable to use a proper parser. An example is in the second part


    That regex needs the /s modifier so that . matches linefeeds as well, and then it works. Without it the pattern .*? stops at newlines. It's only by happenstance that the as it is regex still manages to match some of the string, appearing to "work but not right."

    The point is that that string, with \ns in it, has newlines if it's assigned as double-quoted

    my $html_buffer = " ... ";   # or:  = qq(...);
    

    because then \n is interpreted as a linefeed. If that string is given with single quotes,

    my $html_buffer = ' ... ';   # or:  = q(...);
    

    then there are no newlines in it, just occasional characters \ and n one after antoher, and .*? works as intended.

    Finally: The substitution operator affects only the part of the string that has been matched (or less, if the pattern is written so to "drop" some of the match, for example with \K), so there is no reason for the leading and trailing (.*?). All that's needed is

    $html_buffer =~ s{<a>(.*?)</a>}{$1}sg;
    

    And for a little extra safety add little extra spaces in all these patterns, for <\s*a\s*> so

    $html_buffer =~ s{<\s*a\s*>(.*?)<\s*/a\s*>}{$1}sg;
    

    To note, I'd find it a far superior solution to fix that original script so that it doesn't leave unintended bits and pieces behind. I'd hazard a guess that that happened by using regex for HTML? It would be very surprising that any of the major libraries written to parse HTML/XML would cripple (valid I presume) HTML like that.

    On the other hand, that can be done now, as well. If the remaining text and its not-links (<a> with no href attribute technically isn't a hyperlink) are exceedingly simple then it may be simpler to use a regex (fingers crossed) as asked; just this one time.

    In all other cases, here's a very basic take with Mojo::DOM

    use warnings;
    use strict;
    use feature 'say';
    
    use Mojo::DOM; 
    
    my $html = q(<p> a link: <a>no href</a>, not. <p> OK: <a href="#">hoho</a>);
    
    my $dom = Mojo::DOM->new( $html );    
    say $dom; 
    
    foreach my $link ($dom->find("a")->each) { 
    
        if (not defined $link->attr->{href}) {
            # Replace this node with its text
            $link->replace( $link->text );
        }
    
    }
    say $dom;  # object stringifies to just its HTML
    

    (or $link->strip; to remove the HTML element but leave its content, with the same result)

    This prints

    <p> a link: <a>no href</a>, not. </p><p> OK: <a href="#">hoho</a></p>
    <p> a link: no href, not. </p><p> OK: <a href="#">hoho</a></p>
    

    I've used a shorter made-up HTML string, but I checked with the question's example as well. There are other ways to do this with Mojo and it's fun (and useful) exploring them.

    For example, process only the links for which href attribute isn't defined, by filtering first

    $_->replace($_->text) for 
        $dom -> find("a")
             -> grep( sub { not defined $_->attr->{href} } )
             -> each;
    

    Or, process directly

    $dom -> find("a")
         -> each( sub { 
                $_->replace($_->text) if not defined $_->attr->{href} 
            });