Search code examples
htmldomdomdocumentdomxpath

DOMXpath/DOMDocument - How to parse HTML dom elements not only with simple text


Here is my code:

$url = "https://www.leaseweb.com/dedicated-servers/single-processor";

libxml_use_internal_errors(true); 
$doc = new DOMDocument();

$doc->loadHTMLFile($url);

$xpath = new DOMXpath($doc);

$n = $xpath->query('//td[@data-column-name="Model"]');
$r = $xpath->query('//td[@data-column-name="RAM"]');
$l = $xpath->query('//td[@data-column-name="Location"]');
$item = 0;
$i = 0;
foreach ($n as $entry) {
    $Name = $entry->nodeValue;
    $RAM  = $r->item($item)->nodeValue;
    $Location  = $l->item($item)->nodeValue;
    $i++;
    ?>
     <tr> <td><?PHP echo $i;?></td> <td><?PHP echo $Name;?></td> <td> <?PHP echo $RAM;?> </td> <td class="hidden-xs"><?PHP echo $Location;?> </td> <td><span class="label label-success">Configure</span></td> </tr>
    <?PHP
    $item++;
}

This code is giving me results only like text: The selected td element with data-column-name="Location" for example holds <span id="inside_element">Holded text</span> and instead of getting it with the span i receive only simple text like this: Holded text.

How can i fetch and the HTML elements inside specific dom html element ?

Thanks in advance!


Solution

  • Whenever you need to grab raw HTML fragment from specific node you must invoke DOMNode::C14N(). This method canonicalize nodes to a raw HTML string. Let's take a look on this example:

    <?php 
    $html = '<html>
    <head>  
    </head>
    <body>
        <div class="container">
            <div>
                <span>text span</span>
            </div>
        </div>
    </body>
    </html>';
    
    $dom = DOMDocument::loadHTML($html);
    $xpath = new DOMXPath($dom);
    $nodes = $xpath->query('//div[@class="container"]/div');
    
    
    print $nodes->item(0)->C14N();
    

    As I want to get HTML content under div.container > div the output will be::

    <div>
        <span>text span</span>
    </div>
    

    Alternative method

    There is a less conventional method to achieve the same result. That is, saving the HTML of a specifc HTML node, like this:

    $node = $nodes->item(0);
    
    print $node->ownerDocument->saveHTML($node); // equivalent: $nodes->C14N();
    

    So on your specific case, it's something like this:

    <?php 
    $url = "https://www.leaseweb.com/dedicated-servers/single-processor";
    $doc = new DOMDocument();
    @$doc->loadHTMLFile($url);
    $xpath = new DOMXPath($doc);
    $l = $xpath->query('//td[@data-column-name="Location"]/div');
    
    var_dump($l->item(0)->C14N()); 
    # Or $l->item(0)->ownerDocument->saveHTML($l->item(0));