I'm trying to sort through the HTML of an external website and, unfortunately, the site is very poorly organized. The data might look something like this:
<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>
<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>
And I'm working with an xpath query like this for the titles:
$titles = $x->evaluate('//a[@class="title"]');
Now, I want to list the titles with the items below them. Unfortunately, none of these elements are conveniently wrapped in a parent div, so I can't just filter through everything in the parent. So, I use a query like this for the items:
$titles = $x->evaluate('//a[@class="item"]');
Ideally, what I'd like to do is ONLY check for results below the current title element. So, if I'm looping through and hit "title one", I want to only check the "item" results that appear between title one and title two. Is there any way to do this?
Modifying the HTML is not an option here. I know this question is a little ridiculous and my explanation might be horrible, but if there's a solution, it would really help me!
Thanks everyone.
You can find the title elements first and then use the ->nextSibling()
to move forward:
$html =<<<EOM
<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>
<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>
EOM;
$d = new DOMDocument;
$d->loadHTML($html);
$x = new DOMXPath($d);
foreach ($x->query('//a[@class="title"]') as $node) {
echo "Title: {$node->nodeValue}\n";
// iterate the siblings
while ($node = $node->nextSibling) {
if ($node->nodeType != XML_ELEMENT_NODE) {
continue; // skip text nodes
}
if ($node->getAttribute('class') != 'item') {
// no more .item
break;
}
echo "Item: {$node->nodeValue}\n";
}
}
Output:
Title: Title One
Item: Item One
Item: Item Two
Title: Title Two
Item: Item One
Item: Item Two