I'm writing a parser using Python/lex and trying to create an entry to remove C-style comments. My current (faulty) attempt is:
def t_comment_ignore(t):
r'(\/\*[^*]*\*\/)|(//[^\n]*)'
pass
This produced a quirk that baffled me. When I parse the string below:
input = """
if // else mystery
=/*=*/=
true /* false
*/ return"""
The output tokens are:
['IF', 'EQUAL', 'TIMES', 'EQUAL', 'DIVIDE', 'EQUAL', 'TRUE', 'RETURN']
Apparently the comment on line 3 wasn't recognized properly and 3 of the symbols therein were returned as tokens.
But if I add a space before the comment in line 3, i.e.:
input = """
if // else mystery
= /*=*/=
true /* false
*/ return"""
I get:
['IF', 'EQUAL', 'EQUAL', 'TRUE', 'RETURN']
Debugging showed that all 3 comments were recognized correctly when the extra space was added.
Well, I'm utterly baffled by this behavior. Any input is appreciated.
Thanks, Paulo
PS: As some probably noticed, this enchilada is from Problem Set 2 in https://www.udacity.com/wiki/cs262. They give a more elaborate solution using another of lex's features, but I'm wondering if my approach is sound and if my code is fixable.
My guess is that your pattern for EQUALS matches =.
instead of (or as well as) =
.
By the way, the correct comment pattern is /[*][^*]*[*]+([^/*][^*]*[*]+)*/|//[^\n]*
.