Search code examples
phpregexcommentsconditional-statementsstrip

Stripping HTML Comments With PHP But Leaving Conditionals


I'm currently using PHP and a regular expression to strip out all HTML comments from a page. The script works well... a little too well. It strips out all comments including my conditional comments in the . Here's what I've got:

<?php
  function callback($buffer)
  {
        return preg_replace('/<!--(.|\s)*?-->/', '', $buffer);
  }

  ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>

Since my regex isn't too hot I'm having trouble trying to figure out how to modify the pattern to exclude Conditional comments such as:

<!--[if !IE]><!-->
<link rel="stylesheet" href="/css/screen.css" type="text/css" media="screen" />
<!-- <![endif]-->

<!--[if IE 7]>
<link rel="stylesheet" href="/css/ie7.css" type="text/css" media="screen" />
<![endif]-->

<!--[if IE 6]>
<link rel="stylesheet" href="/css/ie6.css" type="text/css" media="screen" />
<![endif]-->

Cheers


Solution

  • Since comments cannot be nested in HTML, a regex can do the job, in theory. Still, using some kind of parser would be the better choice, especially if your input is not guaranteed to be well-formed.

    Here is my attempt at it. To match only normal comments, this would work. It has become quite a monster, sorry for that. I have tested it quite extensively, it seems to do it well, but I give no warranty.

    <!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->
    

    Explanation:

    <!--                #01: "<!--"
    (?!                 #02: look-ahead: a position not followed by:
      \s*               #03:   any number of space
      (?:               #04:   non-capturing group, any of:
        \[if [^\]]+]    #05:     "[if ...]"
        |<!             #06:     or "<!"
        |>              #07:     or ">"
      )                 #08:   end non-capturing group
    )                   #09: end look-ahead
    (?:                 #10: non-capturing group:
      (?!-->)           #11:   a position not followed by "-->"
      .                 #12:   eat the following char, it's part of the comment
    )*                  #13: end non-capturing group, repeat
    -->                 #14: "-->"
    

    Steps #02 and #11 are crucial. #02 makes sure that the following characters do not indicate a conditional comment. After that, #11 makes sure that the following characters do not indicate the end of the comment, while #12 and #13 cause the actual matching.

    Apply with "global" and "dotall" flags.

    To do the opposite (match only conditional comments), it would be something like this:

    <!(--)?(?=\[)(?:(?!<!\[endif\]\1>).)*<!\[endif\]\1>
    

    Explanation:

    <!                  #01: "<!"
    (--)?               #02: two dashes, optional
    (?=\[)              #03: a position followed by "["
    (?:                 #04: non-capturing group:
      (?!               #05:   a position not followed by
        <!\[endif\]\1>  #06:     "<![endif]>" or "<![endif]-->" (depends on #02)
      )                 #07:   end of look-ahead
      .                 #08:   eat the following char, it's part of the comment
    )*                  #09: end of non-capturing group, repeat
    <!\[endif\]\1>      #10: "<![endif]>" or "<![endif]-->" (depends on #02)
    

    Again, apply with "global" and "dotall" flags.

    Step #02 is because of the "downlevel-revealed" syntax, see: "MSDN - About Conditional Comments".

    I'm not entirely sure where spaces are allowed or expected. Add \s* to the expression where appropriate.