Search code examples
phphtmldomdomdocument

Replace span's in PHP but keep content inside


I have the following string:

<span style="font-size: 13px;">
   <span style="">
      <span style="">
         <span style="font-family: Roboto, sans-serif;">
            <span style="">
               Some text content
            </span>
         </span>
      </span>
   </span>
</span>

and I want to change this string to the following using PHP:

<span style="font-size: 13px;">
   <span style="font-family: Roboto, sans-serif;">
      Some text content
   </span>
</span>

I dont have any idea, how to do that, because when I try to use str_replace to replace the <span style=""> I dont know, how to replace the </span> and keep the content inside. My next problem is, that I dont know exactly, how much <span style=""> I have in my string. I also have not only 1 of this blocks in my string.

Thanks in advance for your help, and maybe sorry for my stupid question - I'm still learning.


Solution

  • This is easily done with a proper HTML parser. PHP has DOMDocument which can parse X/HTML into the Document Object Model which can then be manipulated how you want.

    The trick to solving this problem is being able to recursively traverse the DOM tree, seeking out each node, and replacing the ones you don't want. To this I've written a short helper method by extending DOMDocument here...

    $html = <<<'HTML'
    <span style="font-size: 13px;">
       <span style="">
          <span style="">
             <span style="font-family: Roboto, sans-serif;">
                <span style="">
                   Some text content
                </span>
             </span>
          </span>
       </span>
    </span>
    HTML;
    
    class MyDOMDocument extends DOMDocument {
        public function walk(DOMNode $node, $skipParent = false) {
            if (!$skipParent) {
                yield $node;
            }
            if ($node->hasChildNodes()) {
                foreach ($node->childNodes as $n) {
                    yield from $this->walk($n);
                }
            }
        }
    }
    
    libxml_use_internal_errors(true);
    
    $dom = new MyDOMDocument;
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    $keep = $remove = [];
    
    foreach ($dom->walk($dom->childNodes->item(0)) as $node) {
        if ($node->nodeName !== "span") { // we only care about span nodes
            continue;
        }
        // we'll get rid of all span nodes that don't have the style attribute
        if (!$node->hasAttribute("style") || !strlen($node->getAttribute("style"))) {
            $remove[] = $node;
            foreach($node->childNodes as $child) {
                $keep[] = [$child, $node];
            }
        }
    }
    
    // you have to modify them one by one in reverse order to keep the inner nodes
    foreach($keep as [$a, $b]) {
        $b->parentNode->insertBefore($a, $b);
    }
    foreach($remove as $a) {
        if ($a->parentNode) {
            $a->parentNode->removeChild($a);
        }
    }
    
    // Now we should have a rebuilt DOM tree with what we expect:
    echo $dom->saveHTML();
    

    Output:

    <span style="font-size: 13px;">
    
    
             <span style="font-family: Roboto, sans-serif;">
    
                   Some text content
    
             </span>
    
    
    </span>