Search code examples
htmlregexperl

Perl: HTTP::Tiny delete leaves broken anchor tags


I wrote a script that collects all URLs within a buffer that's read from a database, checks whether that page still exists, and uses HTTP::Tiny to delete the URL from the buffer if it is unreachable or returns invalid.

The problem is that HTTP::Tiny delete left anchor tags like text here that are invalid. The links are highlighted, but there's obviously no way to click them. Is this a deficiency with HTTP::Tiny delete or am I using it wrong?

my $html_full = $ref->{'fulltext'}; # $ref is a pointer to the database
my $dom_buff = Mojo::DOM->new($html_buff);
foreach my $ele ($dom_buff->find('a[href]')->each) {
  my $url = $ele->attr('href');
  my $response = HTTP::Tiny->new(default_headers => { Accept => '*/*' })->get($url);
  if ($response->{success}) {
     $success_fulltext_urls{$ref->{'id'}}{$url} = 1;
  } else {
     delete $ele->attr->{href};
     $html_buff = $dom_buff;
     $html_buff =~ s{<a>(.*?)</a>}{$1}sg;
     my $sql      = "not described here";
     write_sql($dbh,$sql,$ref->{'id'});
  }
}

Here is an example string, after it's been processed by the code above.

This week, perhaps the most interesting articles include &quot;<a>Finding \r\n  that Windows is superior to Linux is biased</a>,&quot; &quot;<a href=\"http://www.example.com/content/view/118693\">How \r\n  to set up DNS for Linux VPNs</a>,&quot; and &quot;<a href=\"http://www.example.com/content/view/118664 \">Writing \r\n  an Incident Handling and Recovery Plan</a>.&quot;

Note the string "Finding \r\n that Windows is superior to Linux is biased" used to be a valid link with an href, but the delete function stripped all that out and left the anchor tags.

Is this the intended effect? Perhaps I should be using a different library or function within HTTP::Tiny?


Solution

  • You're misunderstanding what delete does. All your code does is remove the href attribute from that DOM element in your Mojo::DOM representation. It has nothing to do with HTTP::Tiny.

    What you actually want to do is call ->strip on the <a> element, which removes it from the DOM, but keeps its content intact.

    Since you are already using Mojo::DOM, you can just as well use Mojo::UserAgent. There is no need to pull in another UA module. You've already got the whole Mojolicious installed anyway.

    You can use a HEAD request rather than a GET request to check if a resource is available. There is no need to download the whole thing, the headers are sufficient.

    Your code (without the DB part) can be reduced to this.

    use strict;
    use warnings;
    use Mojo::DOM;
    use Mojo::UserAgent;
    
    my $ua = Mojo::UserAgent->new;
    my $dom = Mojo::DOM->new(<DATA>);
    
    foreach my $element ($dom->find('a[href]')->each) {
        $element->strip
            unless $ua->head($element->attr('href'))->res->is_success;
    }
    
    print $dom;
    
    __DATA__
    This <a href="http://example.org">link works</a>.
    This <a href="http://httpstat.us/404">one does not</a>!
    

    This outputs:

    This <a href="http://example.org">link works</a>. This one does not!