Search code examples
phpweb-scrapingdomdocumentdomxpath

Getting non-object type randomly when traversing with php DOMDocument


Below is my code:

$xpath = new DOMXPath($doc);
// Start from the root element
$query = '//div[contains(@class, "hudpagepad")]/div/ul/li/a';
$nodeList = @$xpath->query($query);

// The size is 104
$size = $nodeList->length;

for ( $i = 1; $i <= $size; $i++ ) {
    $node = $nodeList->item($i-1);
    $url = $node->getAttribute("href");

    $error = scrapeURL($url);
}

function scrapeURL($url) {
    $cfm = new DOMDocument();
    $cfm->loadHTMLFile($url);
    $cfmpath = new DOMXPath($cfm);
    $pointer = $cfm->getElementById('content-area');
    $filter = 'table/tr';

    // The problem lies here    
    $state = $pointer->firstChild->nextSibling->nextSibling->nodeValue;

    $nodeList = $cfmpath->query($filter, $pointer);
}

Basically this traverses to a list of links and scrapes each link with the scrapeURL method.

I don't know the problem here but randomly i get an non-object type error trying to get the $pointer and sometimes it passes through without any error and the values are correct.

Anyone knows the problem here? I'm guessing that the point when the problem occurs is when the page is not loaded properly?


Solution

  • I found the idea of the answer here:

    http://sharovatov.wordpress.com/2009/11/01/php-loadhtmlfile-and-a-html-file-without-doctype/

    it is better to use a 'manual' query than using getElementById coz it breaks if the DOCTYPE of the document your about to load is not well formed.

    so use this instead:

    $cfmpath->query("//*[@id='content-area']")

    or create a method

    function getElementById($id) {
        global $dom;
        $xpath = new DOMXPath($dom);
        return $xpath->query("//*[@id='$id']")->item(0);
    }
    

    Thank you for those who attempted to help!