Search code examples
phpdomxpath

PHP with DOMXPath - How to select and count from this html tree


I need to count how many of these items are open, and there are four types of them: Easy, Medium, Difficult and Not-Wanted. All of these types are values inside the div's. I need to exclude the 'Not-Wanted' types from the count. Notice the 'Open' and 'Close' values have different number of spaces around them. This is the html structure:

<table>
    <tbody>
        <tr>
            <td>
                <div>Difficult</div>
            </td>
            <td>Name</td>
            <td>  Open </td>
        </tr>
        <tr>
            <td>
                <div>Easy</div>
            </td>
            <td>Name</td>
            <td> Closed  </td>
        </tr>
        <tr>
            <td>
                <div>Easy</div>
            </td>
            <td>Name</td>
            <td>   Open   </td>
        </tr>
        <tr>
            <td>
                <div>Medium</div>
            </td>
            <td>Name</td>
            <td>Open </td>
        </tr>
        <tr>
            <td>
                <div>Easy</div>
            </td>
            <td>Name</td>
            <td> Open     </td>
        </tr>
        <tr>
            <td>
                <div>Medium</div>
            </td>
            <td>Name</td>
            <td>  Closed</td>
        </tr>
        <tr>
            <td>
                <div>Easy</div>
            </td>
            <td>Name</td>
            <td>Closed </td>
        </tr>
        <tr>
            <td>
                <div>Not-wanted</div>
            </td>
            <td>Name</td>
            <td> Open </td>
        </tr>
        <tr>
            <td>
                <div>Difficult</div>
            </td>
            <td>Name</td>
            <td>Open</td>
        </tr>
        ............

This is one of my attempts to solve the problem. It is obviously wrong, but I don't know how to get it right.

$doc = new DOMDocument();
$doc->loadHtmlFile('http://www.nameofsite.com');
$doc->preserveWhiteSpace = false;
$xpath = new DOMXPath($doc);

$elements = $xpath->query("/html/body/div[1]/div/section/div/section/article/div/div[1]/div/div/div[2]/div[1]/div[2]/div/section/div/div/table/tbody/tr");

$count = 0;
foreach ($elements as $element) {
    if ($element->childNodes->nodeValue != 'Not-wanted') {
        if ($element->childNodes->nodeValue === 'open') {
            $count++;
        }
    }
}

echo $count;

I have a very rudimental knowledge of DOMXPath, so it is too complex for me, since I'm only able to create simple queries.

Can anybody help?

Thanks in advance.


Solution

  • Based on the data in your example, I think you can adjust the xpath expression to this to get all the <tr>'s that match your conditions:

    //table/tbody/tr[normalize-space(td[3]/text()) = 'Open' and td[1]/div/text() != 'Not-wanted']

    $elements is then of type DOMNodeList and you can then get the length property to get the number of nodes in the list.

    For example:

    $source = <<<SOURCE
    <table>
        <tbody>
            <tr>
                <td>
                    <div>Difficult</div>
                </td>
                <td>Name</td>
                <td>  Open </td>
            </tr>
            <tr>
                <td>
                    <div>Easy</div>
                </td>
                <td>Name</td>
                <td> Closed  </td>
            </tr>
            <tr>
                <td>
                    <div>Easy</div>
                </td>
                <td>Name</td>
                <td>   Open   </td>
            </tr>
            <tr>
                <td>
                    <div>Medium</div>
                </td>
                <td>Name</td>
                <td>Open </td>
            </tr>
            <tr>
                <td>
                    <div>Easy</div>
                </td>
                <td>Name</td>
                <td> Open     </td>
            </tr>
            <tr>
                <td>
                    <div>Medium</div>
                </td>
                <td>Name</td>
                <td>  Closed</td>
            </tr>
            <tr>
                <td>
                    <div>Easy</div>
                </td>
                <td>Name</td>
                <td>Closed </td>
            </tr>
            <tr>
                <td>
                    <div>Not-wanted</div>
                </td>
                <td>Name</td>
                <td> Open </td>
            </tr>
            <tr>
                <td>
                    <div>Difficult</div>
                </td>
                <td>Name</td>
                <td>Open</td>
            </tr>
        </tbody>
    </table>
    SOURCE;
    
    $doc = new DOMDocument();
    $doc->loadHTML($source);
    $doc->preserveWhiteSpace = false;
    $xpath = new DOMXPath($doc);
    $elements = $xpath->query("//table/tbody/tr[normalize-space(td[3]/text()) = 'Open' and td[1]/div/text() != 'Not-wanted']");
    echo $elements->length;
    

    Which will result in:

    5

    Demo