Search code examples
phpnullfindscreen-scrapingsimple-html-dom

PHP Simple HTML Dom Parser find() crashing while traversing past a null element


I'm trying to chain Simple HTML DOM Parser find() to traverse through HTML, however it seems to crash when one of the children is absent. For example:

$obj = $page->find('#headings', 0)->find('h4', 0)->nodes[0];

will cause PHP to die() if find('#headings', 0) or find('h4', 0) returns null (ie. if the element is not in the HTML), but will succeed if all the elements are present.

Is there a way to make the above chain simply return null instead of crashing PHP? I've considered modifying simplehtmldom but not sure how. The find() function is listed below:

// find dom node by css selector
// Paperg - allow us to specify that we want case insensitive testing of the value of the selector.
function find($selector, $idx=null, $lowercase=false)
{
    return $this->root->find($selector, $idx, $lowercase);
}

EDIT: (Solution)

Following user1508519's suggestion I have created an alternative nfind() function. With this approach PHP will still flag a notice if a null property (as opposed to method - the find() method returns an empty node when chained) is referenced further down the chain but will not crash without explanation as it will when using find().

// modified version of simple_html_dom->find() that will return an empty node instead of null when chained if an element is not found. simple_html_dom_node->nfind() must also be created for this to work.
function nfind($selector, $idx=null, $lowercase=false)
{
                $this->root->nfind($selector, $idx, $lowercase);
}

The actual code that performs the find operation can be found in simple_html_dom_node->find() and the following function should be placed inside simple_html_dom_node for the whole package to work correctly (last line only modified - for some reason wrapping the original find() function it and checking for is_null still seems to crash PHP

//modifed version of simple_html_dom_node->find()
function nfind($selector, $idx=null, $lowercase=false)
{
    $selectors = $this->parse_selector($selector);
    if (($count=count($selectors))===0) return array();
    $found_keys = array();

    // find each selector
    for ($c=0; $c<$count; ++$c)
    {
        // The change on the below line was documented on the sourceforge code tracker id 2788009
        // used to be: if (($levle=count($selectors[0]))===0) return array();
        if (($levle=count($selectors[$c]))===0) return array();
        if (!isset($this->_[HDOM_INFO_BEGIN])) return array();

        $head = array($this->_[HDOM_INFO_BEGIN]=>1);

        // handle descendant selectors, no recursive!
        for ($l=0; $l<$levle; ++$l)
        {
            $ret = array();
            foreach ($head as $k=>$v)
            {
                $n = ($k===-1) ? $this->dom->root : $this->dom->nodes[$k];
                //PaperG - Pass this optional parameter on to the seek function.
                $n->seek($selectors[$c][$l], $ret, $lowercase);
            }
            $head = $ret;
        }

        foreach ($head as $k=>$v)
        {
            if (!isset($found_keys[$k]))
                $found_keys[$k] = 1;
        }
    }

    // sort keys
    ksort($found_keys);

    $found = array();
    foreach ($found_keys as $k=>$v)
        $found[] = $this->dom->nodes[$k];

    // return nth-element or array
    if (is_null($idx)) return $found;
    else if ($idx<0) $idx = count($found) + $idx;
    return (isset($found[$idx])) ? $found[$idx] : new simple_html_dom_node('');
}

Thanks again to user1508519 for helping me come to the desired solution while providing a range of equally valid alternatives! Comments are welcome as to the validity of the solution/potential side effects or if there is a more elegant way to accomplish this should anyone have further input.


Solution

  • Why would you do it in a chain? Why not check in subsequent checks if each call is null? Like the comment said, you cannot operate on a null object. If you were doing a foreach loop, it would remove the need for a null check.

    $obj = $page->find('#headings', 0);
    if (!is_null($obj)) {
       $obj = $page->find('h4', 0);
       if (!is_null($obj))
           // ...continue...
    }
    

    EDIT:

    function find($selector, $idx=null, $lowercase=false)
    {
        if (is_null($this->root->find($selector, $idx, $lowercase)))
        {
             die("error");
             // throw exception?
        } else // whatever
    
    }
    

    OR

    Write a wrapper function of your own that internally calls simple's find.

    Like

    function wrapper($selector, $idx=null, $lowercase=false) {
        // yep 
    }