Search code examples
phpdomxpathscreen-scraping

Siblings with dom/xpath


Have been trying several days to parse the following html code (notice that there is not a real hierarchal tree structure). Everything is pretty much on the same level.

<p><span class='one'>week number</span></p>

<p><span class='two'>day of the week</span></p>
<table class='spreadsheet'>
table data
</table>

<p><span class='two'>another day of the week</span></p>
<table class='spreadsheet'>
table data
</table>

<p><span class='one'>another week number</span></p>
ETC

What I basically want to do is, to go through each dom element, check whether it is a week, if it is, add all the days of the week to that specific week, and add all the table data to the corresponding day of the week. So something of the following structure:

array {
31 => array {
    monday => array {
        data => table data
    }
    tuesday => array {
        data => table data
    }   
}

32 => array {
    monday => array {
        data => table data
    }
    tuesday => array {
        data => table data
    }   
}
}

This is my PHP code that I have so far.

$d = new DomDocument;
@$d->loadHtml($html);
$xp = new DomXpath($d);

$res = $xp->query( "//*[@class='one' or @class='two' or @class='spreadsheet']" ); 

foreach ($res as $dn) {
    $nodes = $dn->childNodes;
    foreach ($nodes as $node) {
        if ($node->nodeValue != "") {
            echo $node->nodeValue;
        }

    }
}

I have been tipped by some people here at stackoverflow to use Xpath in order to achieve this, the above code handles each node separately. What I think I need to be doing is get all the "week" nodes, and than get their next sibling, check from there wether it is a day, if so add this to that array, if it is a "week" node, create a new array etc etc

I have been tearing my hair out the past few days with this, so any help/push in the right direction would be very much appreciated!!!

Cheers, Dandoen


Solution

  • Updated; see below.

    It would help if you would tell us what the output is of the code you've tried so far. That would help us know what already works and what's still broken. However, here's what I see looking at your use of XPath and DOM. (Disclaimer: my expertise is in XPath and DOM, not PHP.)

    $res = $xp->query( "//*[@class='one' or @class='two' or @class='spreadsheet']" ); 
    

    This XPath query will give you all the <span> and <table> nodes in your sample, because those are the elements that have the classes you asked for.

    foreach ($res as $dn) {
    

    Iterating over the span and table elements. Inside this loop is where you probably want to say if ($dn->getAttribute("class") == "one") ... and if so start a new week in your array structure; if the class is "two", add a new week day to your current week, etc.

    $nodes = $dn->childNodes;
    

    Here you're asking for the child nodes of the current span or table element. For the span, the only child node you've shown is a text node such as "another day of the week". For the table element, we assume there are tr elements etc.

    foreach ($nodes as $node) {
    

    Iterating over the single text node in a span (or child elements of a table):

        if ($node->nodeValue != "") {
            echo $node->nodeValue;
        }
    

    Print the text content of a text node (child of a span element); or 'null' if we're looking at an element (like the tr child of a table).

    So that's what the above code seems to be doing. If it's not behaving as described, post info about the actual output and we may be able to help. If it's behaving as described but you need help with the part about creating week array elements, let us know that.

    Update:

    I would suggest that you use this XPath query:

    $weeks = $xp->query( "//*[@class='one']" ); 
    

    to get the week number nodes. Then iterate over them:

    foreach ($weeks as $week) {
        $weekNum = $week->firstChild->nodeValue;
    

    This gets the week number out of the first child (a text node) of the week span.

    Create an array entry for the new week. Then select the potential week day nodes for that week:

    $spans = $xp->query( "following::span[@class='one' or @class='two']", $week );
    

    The second argument to $xp->query() is the context node, from which the following:: axis begins.

    Iterate over those:

    foreach ($spans as $span) {
    

    When you get to another week, stop:

        if ($span->getAttribute("class") == "one") break;
    

    Otherwise double-check that it's a weekday:

        if ($span->getAttribute("class") == "two") {
    

    then add the new weekday to your array. To get the table data (fixed a mistake):

            $table = $xp->query("following-sibling::table[1]", $span->parentNode);
    

    Update: To get at the table data, you'll want to set up more loops like the above. Something like:

        $rows = $xp->query("tr", $table);
    

    to get the table rows. Then iterate through those with foreach, and within those,

        $cells = $xp->query("td", $row);
    

    And when you iterate through cells, your data will be

        $cell->firstChild->nodeValue
    

    i.e. the text of the child text node. Note this won't work properly if you have elements inside the <td> cells.

    If you need help with creating and populating arrays in PHP, I'm not the person to advise you on that as I'm not a PHP developer.

    Note this is all untested. HTH.