Search code examples
phpdomdocumentdomxpath

Get DOMXpath results below previous result in HTML


I'm trying to sort through the HTML of an external website and, unfortunately, the site is very poorly organized. The data might look something like this:

<a class="title">Title One</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>

<a class="title">Title Two</a>
<a class="item">Item One</a>
<a class="item">Item Two</a>    

And I'm working with an xpath query like this for the titles:

$titles = $x->evaluate('//a[@class="title"]');

Now, I want to list the titles with the items below them. Unfortunately, none of these elements are conveniently wrapped in a parent div, so I can't just filter through everything in the parent. So, I use a query like this for the items:

$titles = $x->evaluate('//a[@class="item"]');

Ideally, what I'd like to do is ONLY check for results below the current title element. So, if I'm looping through and hit "title one", I want to only check the "item" results that appear between title one and title two. Is there any way to do this?

Modifying the HTML is not an option here. I know this question is a little ridiculous and my explanation might be horrible, but if there's a solution, it would really help me!

Thanks everyone.


Solution

  • You can find the title elements first and then use the ->nextSibling() to move forward:

    $html =<<<EOM
    <a class="title">Title One</a>
    <a class="item">Item One</a>
    <a class="item">Item Two</a>
    
    <a class="title">Title Two</a>
    <a class="item">Item One</a>
    <a class="item">Item Two</a>
    EOM;
    
    $d = new DOMDocument;
    $d->loadHTML($html);
    $x = new DOMXPath($d);
    foreach ($x->query('//a[@class="title"]') as $node) {
        echo "Title: {$node->nodeValue}\n";
        // iterate the siblings
        while ($node = $node->nextSibling) {
           if ($node->nodeType != XML_ELEMENT_NODE) {
                continue; // skip text nodes
            }
            if ($node->getAttribute('class') != 'item') {
                // no more .item
                break;
            }
            echo "Item: {$node->nodeValue}\n";
        }
    }
    

    Output:

    Title: Title One
    Item: Item One
    Item: Item Two
    Title: Title Two
    Item: Item One
    Item: Item Two