Search code examples
phpregexsimple-html-dom

Regex for removing consecutive character formatting tags


I need a regex to match and replace the consecutive character formatting tags enclosing the entire paragraph tags in simple DOM Html Parser

Input :

<p><b><i>Lorem Ipsum Content</i></b></p>

Expected output : <p>Lorem Ipsum</p>

In the below case regex should match and replace only the <b> tags since that's the only tag that encloses the entire paragraph tag

eg :Input : <p><b>Text <i> some more text </i>text inside </b></p>

output : <p>Text <i> some more text </i>text inside </p>

Thanks .


Solution

  • It will look something like this:

    foreach($html->find('p') as $p) {
      while(preg_match('/^<([^>]+)>(.*)<\/\1>$/', $p->innertext, $m)){
        $p->innertext = $m[2];
      }
    }
    

    Note that the \1 in the regex matches the html tag name from the first capture group, probably not necessary but I did it for the bonus.