Search code examples
ply

PLY - escaping new line in C-style comments


I'm writing a simple parser using PLY. My comments can look like this

# this is a single line comment \
with an escaped new line

My attempt is to use states here. I have

states = (
    ('COMMENT', 'exclusive'),
)
tokens = ('COMMENT')

def t_begin_COMMENT(t):
    r'\#'
    t.lexer.begin('COMMENT')


def t_COMMENT_contents(t):
    r'.|\\\n'


t_COMMENT_ignore = r' '

def t_COMMENT_error(t):
    pass


def t_COMMENT_end(t):
    r'\n'
    t.lexer.begin('INITIAL')

When I do

lexer = lex.lex()
string = "# test \\\ns \n4"
lexer.input(string)
for tok in lexer:
    print(tok)

it should print 4 (I have another token for that, but it's irrelevant now) but I get s and 4 where s is still a comment. How do I write regex for content? Is this because COMMENT ends with \n?


Solution

  • Python regular expressions do not produce the longest match. Alternation (|) in a Python regular expression is ordered; if you use the pattern .|\\\n, then . will always match (unless the string is empty), and so \\\n will never be tried. This is easier to see without the escape symbols:

    >>> import re
    >>> re.match(r'.|ab', 'ab')
    <_sre.SRE_Match object; span=(0, 1), match='a'>
    >>> re.match(r'ab|.', 'ab')
    <_sre.SRE_Match object; span=(0, 2), match='ab'>
    

    It's not at all clear to me why you want to go to all that work, rather than using a single regular expression without having to resort to lexer states.

    def t_comment(t):
        r'\#(\\\n|.)*\n'
        pass
    

    (Note: I would prefer the regular expression r'\#(\\[\s\S]|.)*', which allows a \ to escape anything, including itself. The regex you use doesn't allow you to ever put a backslash at the end of a comment line:

    # This will continue, perhaps unexpectedly: \\
    still a comment
    

    Also, the trailing \n will be ignored anyway, so there's no obvious reason to include it in the pattern, where it might fail to match if the comment is right at the end of the input and the input doesn't terminate with a newline.