Search code examples
pythonlexply

Removing comments using lex: why doesn't this work?


I'm writing a parser using Python/lex and trying to create an entry to remove C-style comments. My current (faulty) attempt is:

def t_comment_ignore(t): 
    r'(\/\*[^*]*\*\/)|(//[^\n]*)'
    pass

This produced a quirk that baffled me. When I parse the string below:

input = """
if // else mystery  
=/*=*/= 
true /* false 
*/ return"""

The output tokens are:

['IF', 'EQUAL', 'TIMES', 'EQUAL', 'DIVIDE', 'EQUAL', 'TRUE', 'RETURN']

Apparently the comment on line 3 wasn't recognized properly and 3 of the symbols therein were returned as tokens.

But if I add a space before the comment in line 3, i.e.:

input = """
if // else mystery  
= /*=*/= 
true /* false 
*/ return"""

I get:

['IF', 'EQUAL', 'EQUAL', 'TRUE', 'RETURN']

Debugging showed that all 3 comments were recognized correctly when the extra space was added.

Well, I'm utterly baffled by this behavior. Any input is appreciated.

Thanks, Paulo

PS: As some probably noticed, this enchilada is from Problem Set 2 in https://www.udacity.com/wiki/cs262. They give a more elaborate solution using another of lex's features, but I'm wondering if my approach is sound and if my code is fixable.


Solution

  • My guess is that your pattern for EQUALS matches =. instead of (or as well as) =.

    By the way, the correct comment pattern is /[*][^*]*[*]+([^/*][^*]*[*]+)*/|//[^\n]*.