Search code examples
perl

Timeout with HTML::TreeBuilder


I was wondering if there is some elegant way to implement timeout with HTML::TreeBuilder.


My current implementation is as follows:

use constant WEB_AGENT   => 'Mozilla/5.0';
use constant WEB_TIMEOUT =>  10;
use constant url         => 'https://...';

my $tree;

eval {
  local $SIG{ALRM} = sub { die "Timeout\n"; };
  alarm(WEB_TIMEOUT);
  $tree = do {
    local $SIG{__WARN__} = sub { };
    local *LWP::UserAgent::_agent   = sub { WEB_AGENT   };
  # local *LWP::UserAgent::_timeout = sub { WEB_TIMEOUT };
    HTML::TreeBuilder->new_from_url(url);
  };
  alarm(0);
};
# check $@

Is it possible to avoid using alarm?


Solution

  • Mojolicious turns this inside out, which I think makes this much easier. I don't have to stitch together various things or work hard to subvert their internals. Here's an outline for how Mojo would do it:

    my $ua = Mojo::UserAgent->new(...);
    $ua->transactor->name( $user_agent_string );
    $ua->request_timeout(5);
    
    my $tx = $ua->get($url);
    
    my $dom = $tx->res->dom;  # play with the HTML through its DOM representation
    

    I show many more examples in https://leanpub.com/mojo_web_clients/.

    If you wanted to still use HTML::TreeBuilder, I'd suggest that you make a minimal subclass that adds a method to return the internal LWP object so that you can affect it in the usual ways. Note that you can set the things you want through the public methods of LWP::UserAgent already.

    There's another possibility here, and it's probably the right answer. The new_from_url merely fetches the page for you and then calls HTML::TreeBuilder->new for you. This means that the solution to your problem is probably best just fetching the content yourself, in any way that you like, then calling parse yourself on a plain HTML::TreeBuilder object.