Search code examples
pythonregexply

Writing a Python regex given this finite state machine


I'm writing a Java lexer in Python using PLY.

I have this finite state machine:

Regular expression matching state machine

Its aim should be to match all the line comments in some code. I want to build a Python regex that does exactly what this machine does.

The regex I want to find will be written in a method called t_IGNORE_LINECOMMENT(t) so that, while lexing, every line comment will be ignored.

All the similiar regexes that I found have some issues, like this

(\/\/[^"\n\r]*(?:"[^"\n\r]*"[^"\n\r]*)*[\r\n]|\/\*([^*]|\*(?!\/))*?\*\/)(?=[^"]*(?:"[^"]*"[^"]*)*$)

that can be tested here.

This one is supposed to match every kind of comment, but can match also "//"/" and fails to match hey = "//comment" //comment ", matching all //comment" //comment " as a comment and not only //comment

In the finite state machine I call A all the alphabet and when I write A/{x,y}, I mean all of the alphabet except x and y.


Solution

  • (?:[^"]|"(?:[^\"]|\\.)*")*?(//.*?[\r\n])
    

    should do what you want (given re.DOTALL): it matches as few as possible non-string characters or strings (themselves any number of non-quote non-escapes or escapes) followed by // and as few characters as possible up to the next EOL character. (The first non-greedy repetition is necessary to cause the comment to be as long as possible.)