Search code examples
phpdomsimple-html-dom

Why doesn't this simple_html_dom selector work when used in entirety but not when broken into smaller selectors?


I'm having a go scraping a page with simple_html_dom. On the page I'm scraping, there's a table with rows, and inside those, a bunch of cells. I'm wanting to get stuff in the third cell in each row. The cell in question doesn't have a class.

<tr class="thisrow">
  <td class="firstcell"><strong>1st</strong></td>
  <td class="secondcell">nothing in here</td>
  <td><strong>blah blah</strong></td>
  <td>something else</td>
</tr>

So to get started, I went straight for the third cell:

foreach($html->find('tr.thisrow td:nth-child(3)') as $thirdcell) {
    echo $thirdcell->innertext // this works, no problem!
}

But then I realised I needed some data in another cell in the row (td.firstcell). This cell has a class, so I thought best to loop through the rows, then use selectors within the context of that row:

foreach($html->find('tr.thisrow') as $row) {

    $thirdcell = $row->find('td:nth-child(3)');
    echo $thirdcell; // this is now empty

    $firstcell = $row->find('td.firstcell');
    echo $firstcell; // this works!

}

So as you can see, my nth-child selector suddenly inside the context of the row loop is not working. What am I missing?


Solution

  • It is a limitation of simple html dom. Apparently it can deal with nth-child selectors, but only when the parent is in the tree below the node on which you apply find.

    But it is a valid selector, as the equivalent JavaScript shows:

    for (var row of [...document.querySelectorAll('tr.thisrow')]) {
        var thirdcell = row.querySelectorAll('td:nth-child(3)');
        console.log(thirdcell[0].textContent); // this works!
    }
    <table border=1>
    <tr class="thisrow">
      <td class="firstcell"><strong>1st</strong></td>
      <td class="secondcell">nothing in here</td>
      <td><strong>blah blah</strong></td>
      <td>something else</td>
    </tr>
    </table>

    As a workaround you could use the array index on the find('td') result:

    foreach($html->find('tr.thisrow') as $row) {
        $thirdcell = $row->find('td');
        echo $thirdcell[2]; // this works
    }
    

    Or, alternatively with children, as td are direct children of tr:

    foreach($html->find('tr.thisrow') as $row) {
        $thirdcell = $row->children();
        echo $thirdcell[2]; // this works
    }