Search code examples
phpregexcommentspreg-replace

Using regex to remove unessential comments from source


This is the expression I have come up with to remove comments from my source code, it removes all comments except browser specific comments.

/<\!--(?!\[if).*?-->/s

I do not 100% understand regex but I managed to "Frankenstein" this together from various expressions I have found while searching through Stack. This is how I am seeing the breakdown of this expression and I would love if someone could help me understand it further.

/<\!--

Is used to search for the part I want to replace with the slashes to escape the ! since it is part of the expression?

(?!\[if)

Says if does not contain [if block?

 .*?

Wildcard so saying no matter what is in between?

-->

The end of the bit I want the expression to find.

/s

Tells the expression to run even if there is white space between the expression?

I don't want to just use a code because it works for what I need, I want to actually understand what I am using and learn how to use it better in the future.

This expression works great, but I do need help taking it one step further. In my source I have code snippets within script tags I.E.

<script type="text/javascript">
  <!--
    // Main vBulletin Javascript Initialization
    vBulletin_init();
  //-->
</script>

Is there a way I can adapt my current expression to exclude <!-- comments withing javascript?

An example of what I am trying to achieve can be seen HERE


Solution

  • You reached the point where you realized that regex patterns are bad to deal with programming languages. Here, the tool to use is obviously an HTML parser. Example using the DOM and XPath:

    $dom = new DOMDocument;
    $dom->loadHTML($html);
    $xp = new DOMXPath($dom);
    
    $comments = $xp->query('//*[not(local-name()="script")]/comment()[not(contains(.,"[if"))]');
    
    foreach($comments as $comment) {
        $comment->parentNode->removeChild($comment);
    }
    
    echo $dom->saveHTML();