I'm writing a Java lexer in Python using PLY.
I have this finite state machine:
Its aim should be to match all the line comments in some code. I want to build a Python regex that does exactly what this machine does.
The regex I want to find will be written in a method called t_IGNORE_LINECOMMENT(t)
so that, while lexing, every line comment will be ignored.
All the similiar regexes that I found have some issues, like this
(\/\/[^"\n\r]*(?:"[^"\n\r]*"[^"\n\r]*)*[\r\n]|\/\*([^*]|\*(?!\/))*?\*\/)(?=[^"]*(?:"[^"]*"[^"]*)*$)
that can be tested here.
This one is supposed to match every kind of comment, but can match also "//"/"
and fails to match hey = "//comment" //comment "
, matching all //comment" //comment "
as a comment and not only //comment
In the finite state machine I call A
all the alphabet and when I write A/{x,y}
, I mean all of the alphabet except x
and y
.
(?:[^"]|"(?:[^\"]|\\.)*")*?(//.*?[\r\n])
should do what you want (given re.DOTALL
): it matches as few as possible non-string characters or strings (themselves any number of non-quote non-escapes or escapes) followed by //
and as few characters as possible up to the next EOL character. (The first non-greedy repetition is necessary to cause the comment to be as long as possible.)