I wrote a script that collects all URLs within a buffer that's read from a database, checks whether that page still exists, and uses HTTP::Tiny to delete the URL from the buffer if it is unreachable or returns invalid.
The problem is that HTTP::Tiny delete left anchor tags like text here that are invalid. The links are highlighted, but there's obviously no way to click them. Is this a deficiency with HTTP::Tiny delete or am I using it wrong?
my $html_full = $ref->{'fulltext'}; # $ref is a pointer to the database
my $dom_buff = Mojo::DOM->new($html_buff);
foreach my $ele ($dom_buff->find('a[href]')->each) {
my $url = $ele->attr('href');
my $response = HTTP::Tiny->new(default_headers => { Accept => '*/*' })->get($url);
if ($response->{success}) {
$success_fulltext_urls{$ref->{'id'}}{$url} = 1;
} else {
delete $ele->attr->{href};
$html_buff = $dom_buff;
$html_buff =~ s{<a>(.*?)</a>}{$1}sg;
my $sql = "not described here";
write_sql($dbh,$sql,$ref->{'id'});
}
}
Here is an example string, after it's been processed by the code above.
This week, perhaps the most interesting articles include "<a>Finding \r\n that Windows is superior to Linux is biased</a>," "<a href=\"http://www.example.com/content/view/118693\">How \r\n to set up DNS for Linux VPNs</a>," and "<a href=\"http://www.example.com/content/view/118664 \">Writing \r\n an Incident Handling and Recovery Plan</a>."
Note the string "Finding \r\n that Windows is superior to Linux is biased" used to be a valid link with an href, but the delete function stripped all that out and left the anchor tags.
Is this the intended effect? Perhaps I should be using a different library or function within HTTP::Tiny?
You're misunderstanding what delete
does. All your code does is remove the href
attribute from that DOM element in your Mojo::DOM representation. It has nothing to do with HTTP::Tiny.
What you actually want to do is call ->strip
on the <a>
element, which removes it from the DOM, but keeps its content intact.
Since you are already using Mojo::DOM, you can just as well use Mojo::UserAgent. There is no need to pull in another UA module. You've already got the whole Mojolicious installed anyway.
You can use a HEAD request rather than a GET request to check if a resource is available. There is no need to download the whole thing, the headers are sufficient.
Your code (without the DB part) can be reduced to this.
use strict;
use warnings;
use Mojo::DOM;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $dom = Mojo::DOM->new(<DATA>);
foreach my $element ($dom->find('a[href]')->each) {
$element->strip
unless $ua->head($element->attr('href'))->res->is_success;
}
print $dom;
__DATA__
This <a href="http://example.org">link works</a>.
This <a href="http://httpstat.us/404">one does not</a>!
This outputs:
This <a href="http://example.org">link works</a>. This one does not!