I've been staring a this for an hour now and I'm throwing in the towel.
I am attempting to scrape some data from a web page. Here's a snippet with some of the data I'm trying to extract:
<span itemprop="thumbnail" itemscope itemtype="http://schema.org/ImageObject">
<link itemprop="url" href="http://blahblah.org/video/thumbnail_23432230.jpg">
<meta itemprop="width" content="1280">
<meta itemprop="height" content="720">
</span>
I want to grab the value of the href property form the tag with the Web::Scraper module. Here's the relevant perl code:
my $div = scraper {
process 'span[itemprop="thumbnail"] > link', url => '@href';
};
my $res = $div->scrape( $html );
$url = $res->{url};
No matter what I try, $url returns undefined. I'm using version .36 of the Web::Scraper module.
This is because of a bug in HTML::TreeBuilder::XPath
. It has a naive understanding of <link>
and <meta>
elements, insisting that they belong only in the <head>
element, even if they have an itemprop
attribute.
The way elements are treated is based on the hashes in HTML::Tagset
, and a fix of sorts can be effected by hacking this data.
If you add this to the top of your program
use HTML::Tagset;
for (qw/ link meta /) {
$HTML::Tagset::isHeadElement{$_} = 0;
$HTML::Tagset::isHeadOrBodyElement{$_} = 1;
}
then it "fixes" the specific situation in your question, but of course a proper solution should take account of the itemprop
attributes as well as the tags.