Search code examples
phpdomdomdocument

PHP DOMDocument: Get inner HTML of node


When loading HTML into an <textarea>, I intend to treat different kinds of links differently. Consider the following links:

  1. <a href="http://stackoverflow.com">http://stackoverflow.com</a>
  2. <a href="http://stackoverflow.com">StackOverflow</a>

When the text inside a link matches its href attribute, I want to remove the HTML, otherwise the HTML remains unchanged.

Here's my code:

$body = "Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a>";

$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($dom->getElementsByTagName('a') as $node) {
    $link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
    $link_href = $node->getAttribute("href");
    $link_node = $dom->createTextNode($link_href);

    $node->parentNode->replaceChild($link_node, $node);
}

$html = $dom->saveHTML();

The problem with the above code is that DOMDocument encapsulates my HTML into a paragraph tag:

<p>Some HTML with a http://stackoverflow.com</p>

How do I get it ot only return the inner HTML of that paragraph?


Solution

  • You need to have a root node to have a valid DOM document.

    I suggest you to add a root node <div> to avoid to destroy a possibly existing one.

    Finally, load the nodeValue of the rootNode or substr().

    $body = "Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a>";
    $body = '<div>'.$body.'</div>';
    
    $dom = new DOMDocument;
    $dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    foreach ($dom->getElementsByTagName('a') as $node) {
        $link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
        $link_href = $node->getAttribute("href");
        $link_node = $dom->createTextNode($link_href);
    
        $node->parentNode->replaceChild($link_node, $node);
    }
    
    // or probably better :
    $html = $dom->saveHTML() ;
    $html = substr($html,5,-7); // remove <div>
    var_dump($html); // "Some HTML with a http://stackoverflow.com"
    

    This works is the input string is :

    <p>Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a></p>
    

    outputs :

    <p>Some HTML with a http://stackoverflow.com</p>