Search code examples
perlhttpstimeoutblocking

Perl: HTTP::Tiny Connection Stalls, get() never returns


I'm using perl-HTTP-Tiny-0.080 on fedora35 and trying to check for the status of a URL to determine the return code. My script runs fine until it comes across this particular URL with a PDF at sophos.com. The script just stalls and the get() or head() call with new() just never returns. I've also tried to set a timeout and it appears to be ignored.

use HTTP::Tiny;  
use Net::FTP::Tiny qw(ftp_get);
my $url = "https://news.sophos.com/wp-content/uploads/2020/02/CloudSnooper_report.pdf";
my $response = HTTP::Tiny->new(timeout => 2)->get($url);
print "status: $response->{status} $url\n";

The print is just never reached. Using wget manually succeeds, while trying to set the agent to something other than "HTTP/Tiny" fails.

my $response = HTTP::Tiny->new(agent => "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")->get($url);

This code is part of a larger script that I'm using to check a series of URLs from a buffer to determine whether they're 404s and should be removed, or are still working links.

I'm unsure what further info I can provide.


Solution

  • The URL you have for news.sophos.com redirects to some other URL at www.sophos.com. The latter server is protected by Akamai CDN:

    $ dig www.sophos.com
    ...
    www.sophos.com.         169     IN      CNAME   www.sophos.com.edgekey.net.
    www.sophos.com.edgekey.net. 469 IN      CNAME   e6203.b.akamaiedge.net.
    e6203.b.akamaiedge.net. 300     IN      A       23.60.192.131
    

    The bot protection of Akamai can show some weird behavior if the request is not a typical one send by the browser. This might be failing with status code 403 but also just hanging as you experience, i.e. tarpitting the client. See also Requests SSL connection timeout or Strange CURL issue with a particular website SSL certificate. See also Why does Akamai edge services sometime just not send any response, leaving the connection to timeout which incidentally describes a similar problem you have with www.sophos.com.

    In this specific case simply adding an Accept header to the request worked for me:

    my $response = HTTP::Tiny->new(default_headers => { Accept => '*/*' })->get($url);
    

    Note that this workaround might no longer work in the future when Akamai adjusts its bot detection.

    I've also tried to set a timeout and it appears to be ignored.

    This is a known issue, which is especially noticeable when TLS 1.3 is used - as is the case here. See Sometimes, timeout can fail to fire #146.