I am trying to parse screen-scraped data using Zend_Dom_Query, but I am struggling how to apply it properly for my case, and all other answers I have seen on SO make assumptions that quite frankly scare me with their naiveté.
A typical example is How to Pass Array from Zend Dom Query Results to table where pairs of data points are being extracted from the documents body through the use of separate calls to the query()
method.
$year = $dom->query('.secondaryInfo');
$rating = $dom->query('.ratingColumn');
Where the underlying assumptions are that an equal number of $year
and $rating
results exist AND that they are correctly aligned with each other within the document. If either of those assumptions are wrong, then the extracted data is less than worthless - in fact it becomes all lies.
In my case I am trying to extract multiple chunks of data from a site, where each chunk is nominally of the form:
<p class="main" atrb1="value1">
<a href="#1" >href text 1</a>
<span class="sub1">
<span class="span1"></span>
<span class="sub2">
<span class="span2">data span2</span>
<a href="#2">href text 2</a>
</span>
<span class="sub3">
<span class="span3">
<p>Some other data</p>
<span class="sub4">
<span class="sub5">More data</span>
</span>
</span>
</span>
</span>
</p>
For each chunk, I need to grab data from various sections:
And then process the set of data as one distinct unit, and not as multiple collections of different data.
I know I can hard code the selection of each element (and I currently do that), but that produces brittle code reliant on the source data being stable. And this week the data source yet again changed and I was bitten by my hard coded scraping failing to work. Thus I am trying to write robust code that can locate what I want without me having to care/know about the overall structure (Hmmm - Linq for php?)
So in my mind, I want the code to look something like
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $result->query(".main a");
$data2 = $result->query(".main .span2");
$data3 = $result->query(".main .sub a");
etc
if ($data1 && $data2 && $data3) {
Do something
} else {
Do something else
}
}
Is it possible to do what I want with stock Zend/PHP function calls? Or do I need to write some sort of custom function to implement $result->query()
?
OK .. so I bit the bullet and wrote my own solution to the problem. This code recurses through the results from the Zend_Dom_Query
and looks for matching css selectors. As presented the code works for me and has also helped clean up my code. Performance wasn't an issue for me, but as always Caveat Emptor. I have also left in some commented out code that enables visualization of where the search is leading. The code was also part of a class, hence the use of $this->
in places.
The code is used as:
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $this->domQuery($result, ".sub2 a");
if (!is_null($data1))
{
Do Something
}
}
Which finds the <a href="#2">href text 2</a>
element under the <span class="sub2">
element.
// Function that recurses through a Zend_Dom_Query_Result, looking for css selectors
private function recurseDomQueryResult($dom, $depth, $targets, $index, $count)
{
// Gross checking
if ($index<0 || $index >= $count) return NULL;
// Document where we are
$element = $dom->nodeName;
$class = NULL;
$id = NULL;
// $href = NULL;
// Skip unwanted elements
if ($element == '#text') return NULL;
if ($dom->hasAttributes()) {
if ($dom->hasAttribute('class'))
{
$class = trim($dom->getAttribute('class'));
}
if ($dom->hasAttribute('id'))
{
$id = trim($dom->getAttribute('id'));
}
// if ($element == 'a')
// {
// if ($dom->hasAttribute('href'))
// {
// $href = trim($dom->getAttribute('href'));
// }
// }
}
// $padding = str_repeat('==', $depth);
// echo "$padding<$element";
// if (!($class === NULL)) echo ' class="'.$class.'"';
// if (!($href === NULL)) echo ' href="'.$href.'"';
// echo '><br />'. "\n";
// See if we have a match for the current element
$target = $targets[$index];
$sliced = substr($target,1);
switch($target[0])
{
case '.':
if ($sliced === $class) {
$index++;
}
break;
case '#':
if ($sliced === $id) {
$index++;
}
break;
default:
if ($target === $element) {
$index++;
}
break;
}
// Check for having matched all
if ($index == $count) return $dom;
// We didn't have a match at this level
// So recursively look at all the children
$children = $dom->childNodes;
if ($children) {
foreach($children as $child)
{
if (!is_null(($result = $this->recurseDomQueryResult($child, $depth+1, $targets, $index, $count)))) return $result;
}
}
// Did not find anything
// echo "$padding</$element><br />\n";
return NULL;
}
// User function that you call to find a single element in a Zend_Dom_Query_Result
// $dom is the Zend_Dom_Query_Result object
// $path is a path of css selectors, e.g. ".sub2 a"
private function domQuery($dom, $path)
{
$depth = 0;
$index = 0;
$targets = explode(' ', $path);
$count = count($targets);
return $this->recurseDomQueryResult($dom, $depth, $targets, $index, $count);
}