Search code examples
phpdomdocument

PHP domDocument works incorrectly when the node wrapper in figure?


I'm trying to add some HTML to all links that contain image.

Basic HTML loaded into dom looks like

<div class='content'>
    <a href="..."><img src=""></a>

    <figure>
       <a href="..."><img src=""></a>
       <figcaption>Caption</figcaption>
    </figure>
</div>

The code:

$content = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$dom = new DOMDocument();
@$dom->loadHTML($content);

// Convert Images
$images = [];

foreach ($dom->getElementsByTagName('img') as $node) {
    $images[] = $node;
}

foreach ($images as $node) {    
     $field_html = $dom->createDocumentFragment(); // create fragment
     $field_html->appendXML('<span>11</span>'); // create fragment
     $node->parentNode->appendChild($field_html);  

}

$newHtml = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
return $newHtml; 

So when it's a regular link with img, it produces correct output:

<a href="..."><img src=""><span>11</span></a>

But when it's a figure, output is very strange — link is duplicated and inserted into figcaption:

<figure>
    <a href="..."><img src=""></a>
    <figcaption>Caption <a href="..."><span>11</span>
    </figcaption>
</figure>

Is that because DOMDocument doesn't understand figure thing?


Solution

  • I was unable to reproduce your problem. My guess would be a misplaced element somewhere in your source HTML. But your code can be simplified quite a bit.

    There's no need to put your image nodes into an array, you can work directly with the results of DomDocument::getElementsByTagName().

    As mentioned in comments you can setup DomDocument::loadHTML() not to add the doctype and implied elements, instead of removing them later with potentially tricky string manipulations.

    A simple DomDocument::createElement() can be used for the element you want to append, instead of creating a new object.

    Finally, the error control operator @ should generally be avoided. Instead, libxml_use_internal_errors() can be used to set the error behaviour. This allows you to examine error messages with libxml_get_errors() if desired.

    $content = <<< HTML
    <div class="content">
        <a href="..."><img src=""></a>
        <figure>
           <a href="..."><img src=""></a>
           <figcaption>Caption</figcaption>
        </figure>
    </div>
    HTML;
    
    $dom = new DOMDocument();
    libxml_use_internal_errors(true);
    $dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    libxml_use_internal_errors(false);
    
    foreach ($dom->getElementsByTagName('img') as $node) {
         $node->parentNode->appendChild($dom->createElement("span", "11"));
    }
    
    $newHtml = $dom->saveHTML();
    echo $newHtml;
    

    Output:

    <div class="content">
        <a href="..."><img src=""><span>11</span></a>
        <figure>
           <a href="..."><img src=""><span>11</span></a>
           <figcaption>Caption</figcaption>
        </figure>
    </div>