Search code examples
regexperlexpr

Using Perl to strip everything from a string except HTML Anchor Links


Using Perl, how can I use a regex to take a string that has random HTML in it with one HTML link with anchor, like this:

  <a href="http://example.com" target="_blank">Whatever Example</a>

and it leave ONLY that and get rid of everything else? No matter what was inside the href attribute with the <a, like title=, or style=, or whatever. and it leave the anchor: "Whatever Example" and the </a>?


Solution

  • You can take advantage of a stream parser such as HTML::TokeParser::Simple:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    use HTML::TokeParser::Simple;
    
    my $html = <<EO_HTML;
    
    Using Perl, how can I use a regex to take a string that has random HTML in it
    with one HTML link with anchor, like this:
    
       <a href="http://example.com" target="_blank">Whatever <i>Interesting</i> Example</a>
    
           and it leave ONLY that and get rid of everything else? No matter what
       was inside the href attribute with the <a, like title=, or style=, or
       whatever. and it leave the anchor: "Whatever Example" and the </a>?
    EO_HTML
    
    my $parser = HTML::TokeParser::Simple->new(string => $html);
    
    while (my $tag = $parser->get_tag('a')) {
        print $tag->as_is, $parser->get_text('/a'), "</a>\n";
    }
    

    Output:

    $ ./whatever.pl
    <a href="http://example.com" target="_blank">Whatever Interesting Example</a>