Search code examples
phpdomdomdocumentdomxpath

loadHTML returning empty, html is fine


I'm trying to grab the href value of an element using PHP, but I'm having some trouble. Here's a snippet of my code.

  <?php
  ini_set("log_errors", 1);
  ini_set("error_log", "php-error.log");
  $target_url = "http://foo.bar";
  $request = $target_url;
  $html = $this->scraper($request);
  $dom = new DOMDocument();
  $dom->loadHTML($html);
  // Error point - $dom is empty
  error_log("dom:");
  error_log($dom);
  $xpath = new DOMXPath($dom);
  error_log("setting target url");
  $target_url = $xpath->query("//*[@class='foo_bar']/href");
  ?>

Logging $html results in the standard, full HTML output of the page. A search shows that my xpath should work. However, when I try to log $dom after loadHTML, I get a blank result. I've been struggling for a few hours trying to work out why, but with no luck.

Does anyone have any ideas/anything I could try?

Edited to add console output:

    [30-Sep-2015 13:51:59 America/New_York] dom:
    [30-Sep-2015 13:51:59 America/New_York] setting target url

Solution

  • You should check that the HTML was loaded into the DOM. You can use a debugger, the logging or var_dump() for that.

    var_dump($dom->saveXml());

    If its wasn't loaded into DOM take a step back and validate that the HTML was fetched by the scraper.

    var_dump($html);

    If the HTML was loaded into the DOM you will still need to fix the Xpath. I would expect href being an attribute node.

    //*[@class='foo_bar']/@href

    You seem to want to read it as a string value, so cast it:

    string(//*[@class='foo_bar']/@href)

    That only works with DOMXpath::evaluate(), DOMXpath::query() can only return node lists.

    $target_url = $xpath->evaluate("string(//*[@class='foo_bar']/@href)");
    

    A small example:

    $document = new DOMDocument();
    $document->loadHtml('<a href="http://example.com">Example</a>');
    $xpath = new DOMXpath($document);
    var_dump($xpath->evaluate('string(//a[1]/@href)'));
    

    Output:

    string(18) "http://example.com"