Search code examples
pythonlexpython-re

Regular expression matching with re but not lex


I am trying to parse a file in order to reformat it. For this, I need to be able to distinguish between full line comments and end of line comments. I have been able to get lex to recognize full line comments properly, but am having issues with end of line comments.

For example: "a = 0; //This; works; fine" but "a = 0; //This, does; not;".

What confuses me the most is that re is able to recognise both comments without issue and yet lex can not.

Here is the relevant code (FL=full line, EL=end of line):

tokens = (
    'EQUAL',
    'SEMICOLON',
    'FL_COMMENT',
    'EL_COMMENT',
    'STRING'
)
t_EQUAL = r'='
t_SEMICOLON = r';'
def t_FL_COMMENT(t):
    r"""(^|\n)\s*(//|\#).*"""
    return t
def t_EL_COMMENT(t):
    r"""(?<=;)\s*(//|\#).*"""
    return t
def t_STRING(t):
    r"""(".*")|([a-zA-Z0-9\</][\w.\-\+/]*)"""
    return t
def t_newline(t):
    r"""\n"""
    t.lexer.lineno += len(t.value)
t_ignore = ' \t'
def t_error(t):
    print("Illegal character '%s' on line %d" % (t.value[0], t.lineno))
    t.lexer.skip(1)
def t_eof(t):
    return None

lexer = lex.lex()

lexer.input(file_contents)

for token in lexer:
    print(token)

Solution

  • Lex (including the Ply variety) builds lexical analysers, not regular expression searchers. Unlike a regular expression library, which generally attempts to scan the entire input to find a pattern, lex tries to decide what pattern matches at the current input point. It then advances the input to the point immediately following, and tries to find the matching pattern at that point. And so on. Every character in the text is contained in some matched token. (Although some tokens might be discarded.)

    You can actually take advantage of this fact to simplify your regular expressions. In this case, for example, since you can count on t_FL_COMMENT to match a comment which does occur at the beginning of a line, any other comment must be not at the start of a line. So no lookbehind is needed:

    def t_FL_COMMENT(t):
        r"""(^|\n)\s*(//|\#).*"""
        return t
    
    def t_EL_COMMENT(t):
        r"""(//|\#).*"""
        return t
    

    An alternative to (\n|^) is (?m)^ (which enables multiline mode so that the ^ can match right after a newline, as well as matching at the beginning of the string).