Search code examples
perlscraper

Web::Scraper Cannot find <link> or <meta> elements in the <body> of an HTML document


I've been staring a this for an hour now and I'm throwing in the towel.

I am attempting to scrape some data from a web page. Here's a snippet with some of the data I'm trying to extract:

<span itemprop="thumbnail" itemscope itemtype="http://schema.org/ImageObject">
  <link itemprop="url" href="http://blahblah.org/video/thumbnail_23432230.jpg">
  <meta itemprop="width" content="1280">
  <meta itemprop="height" content="720">
</span>

I want to grab the value of the href property form the tag with the Web::Scraper module. Here's the relevant perl code:

my $div = scraper {
  process 'span[itemprop="thumbnail"] > link', url => '@href';
};
my $res = $div->scrape( $html );
$url = $res->{url};

No matter what I try, $url returns undefined. I'm using version .36 of the Web::Scraper module.


Solution

  • This is because of a bug in HTML::TreeBuilder::XPath. It has a naive understanding of <link> and <meta> elements, insisting that they belong only in the <head> element, even if they have an itemprop attribute.

    The way elements are treated is based on the hashes in HTML::Tagset, and a fix of sorts can be effected by hacking this data.

    If you add this to the top of your program

    use HTML::Tagset;
    
    for (qw/ link meta /) {
        $HTML::Tagset::isHeadElement{$_}       = 0;
        $HTML::Tagset::isHeadOrBodyElement{$_} = 1;
    }
    

    then it "fixes" the specific situation in your question, but of course a proper solution should take account of the itemprop attributes as well as the tags.