Search code examples
pythonparsingply

Multiline PLY parser


I'm using PLY to parse arithmetic expressions that span over multiple lines (or separated through ";"). I'm unsure whether to ignore the newline tokens as I don't really need them. If they are ignored (return None in t_NEWLINE), is it necessary to account for them in the productions?

def t_NEWLINE(self, t):
    r"""\n+|;+"""
    t.lexer.lineno += t.value.count("\n") + t.value.count(";")
    return t # or not

def p_expression(self, t):
    """ expression : ..."""

def p_expressions(self, t):
    """ expressions : expression
                    | expressions NEWLINE expression
    """
    # Do I need NEWLINE at all? How does the production figure out where expression ends without NEWLINE?

Solution

  • This seems to be more a question of language design than implementation, although of course the implementation may not be obvious even when the design is completely specified.

    There are several possible approaches to languages allowing multiple multi-line expressions, and it is not at all clear to me which one you are interested in implementing:

    1. Python style. Expressions are terminated by either ; or newline, but newlines are ignored inside of bracket expressions ((...), [...], {...}).

    2. ECMAscript style. Expressions are required to be terminated by ;, but a ; will be inserted automatically at a line-end if the next token cannot extend the expression.

    3. Maximal-munch. An expression is the longest possible stream of tokens which can be parsed as an expression, but ; can be used to separate expressions if necessary.

      I don't actually know of a common language which implements this option. But it (or some variation) would make it possible to write, for example,

      a = 3    b = 7    c = a + b
      

      which seems like it should be unambiguous.

    Implementation strategies.

    1. Parentheses hide newlines.

    This is fairly straightforward to implement until the language gets complicated. For simple expressions, you just need to keep a global parenthesis depth count, which is incremented with (any kind of) open parenthesis and decremented with closes; then the \n rule returns a ; if the parenthesis depth is 0, and otherwise swallows the newline.

    I personally find this style of line-continuation irritating, although I get used to it pretty fast every time I go back to programming in Python. Still, it always niggles a bit that you cannot write:

    x = some * long * sub-expression       +
        another * long * sub-expression    +
        magical-constant
    

    which would work just fine in ECMAscript.

    2. Automatic semicolons

    As it turns out, you can handle this in ECMAscript in the lexer, by looking up the first token on a line and the previous token in a dictionary of legal token-pairs. If the token pair is not in the dictionary, then a semicolon token is returned instead of the first token on the line; that token is remembered so that the next call to the lexer will return it. Constructing the dictionary of legal token-pairs is non-trivial, and verifying that it is correct is even harder, but it only needs to be done every time you change the grammar :)

    ECMAscript actually has a number of exceptions to the rules, so that some expressions which would otherwise be legal are made illegal when separated by a newline. These exceptions can be implemented (as it happens) by removing some token-pairs from the dictionary of legal pairs. For example, the ++ post-increment operator cannot be separated from its operand by a newline, and consequently a semi-colon is inserted in the following:

    a
    ++b
    

    which would otherwise be a syntax error. (The first two tokens would be parsed as a++, and then the b cannot be parsed.)

    Another interesting case is:

    a + b
    (c + d).format("x")
    

    This could be parsed as a function call:

    a + b(c + d).format("x")
    

    so it would be desirable to do an automatic semi-colon insertion between the b and the (.

    3. Maximal munch

    This is really just a variant on automatic semicolon insertion, where the semicolon may be inserted even when there is no newline between the two tokens. Because of the exceptions noted above, you would probably want to have two different sets of legal token-pairs, or mark certain token-pairs as only valid if not separated by a newline.