How to remove all HTML conditional comments using regular expressions (lex & yacc) ? I want to remove all that comments and leave only the last HTML tag.
I have tried this Regex "<!"(.*?)-->
to get the conditional comments but it didn't work, I am looking for a Regex that matches with theses conditional comments.
"<!"(.*?)-->
Here is the HTML code below : I am trying to delete all the comments and to leave only the last HTML tag.
<!--[if lte IE 7]>
<html class="ie7 oldie" xmlns="http://www.w3.org/1999/xhtml" lang="fr" xml:lang="fr">
<![endif]-->
<!--[if IE 8]>
<html class="ie8 oldie" xmlns="http://www.w3.org/1999/xhtml" lang="fr" xml:lang="fr">
<![endif]-->
<!--[if gt IE 8]><!-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr" xml:lang="fr">
<!--<![endif]-->
Here are two important facts about (f)lex regular expressions. (See the flex manual for complete documentation of Flex patterns. The section is not very long.)
In (f)lex, the .
wildcard matches anything except a newline character. In other words, it is equivalent to [^\n]
. So "<!".*
will only match to the end of the line. You could fix that by using (.|\n)
instead, but see below.
(F)lex does not provide non-greedy repetition (*?
). All repetitions extend to the longest possible match. (.*?)-->
will therefore match up to the last -->
on the line, and (.|\n)*?-->
would match up to the last -->
in the file.
It is possible to write a regular expression which does what you want, although it's a bit messy:
<!--([^-]|-[^-]|--+[^->])*--+>
should work, as long as the input text does not end with an unterminated comment. (The quotes in your pattern are unnecessary, since none of the quoted characters has any special meaning to (f)lex, but they don't hurt. I left them out because I don't think they contribute to make the pattern less unreadable.)
The repeated sequence matches any of:
-
; or-
followed by something other than another -
; or-
followed by something other than >
.The last alternative in the repetition might require some explanation. The underlying problem is to avoid problems with inputs like
<!-- Comment with two many dashes --->
If we'd just written the tempting --[^>]
as the third alternative, --->
would not be recognised as terminating the pattern, since ---
would match --[^>]
(a dash is not a right angle bracket) and >
would then match [^-]
, and the scan would continue. Adding the +
to match a longer sequence of dashes is not enough, because, like many regex engines, (f)lex is looking for the longest overall match, not the longest submatch in each set of alternatives. So we need to write --+[^->]
, which cannot match ---
.
If that was not clear -- and I can see why it wouldn't be --, you could instead use a start condition to write a much simpler set of patterns:
%x COMMENT
%%
"<!--" { BEGIN(COMMENT); }
<COMMENT>{
"-->" { BEGIN(INITIAL); }
[^-]+ ;
.|\n ;
}
The second <COMMENT>
rule is really just an efficiency hack; it avoids triggering a no-op action on every character. With the second rule in place, the last rule really can only match a single -
, so it could have been written that way. But writing it in full allows you to remove the second rule and demonstrate to yourself that it works without it.
The key insight for matching the comment in pieces like this is that (f)lex always chooses the longest match, which is in some ways similar to the goal of non-greedy matches. While inside the <COMMENT>
start condition, -
will only match the single character fallback rule if it cannot be part of the match of -->
, which is longer.