Search code examples
c++parsingantlrgrammarlambda-calculus

Assignment grammar conflicting with λ-calculus application grammar


I am implementing an extended λ-calculus interpreter using ANTLR4 and it's C++ target. Here is the language grammar:

grammar lambda;

program: expression|;

expression:
    (Int | Bool)                                # literal
    | Identifier                                # variable
    | expression expression                     # application
    | Lambda Identifier '.' expression          # abstraction
    | Identifier '=' expression                 # assign
    | condition                                 # conditional
    | Operator expression expression            # binaryExpression
    | 'print' expression                        # printInstruction
    | '(' expression ')'                        # brackets;

body: expression;
condition: 'if' expression 'then' body 'else' body
    | '(' expression '->' body '|' body;

Lambda: '\\' | 'λ';
Bool : 'tru' | 'fls' | 'true' | 'false';
Int: [0-9]+;
Identifier: ('a' ..'z') ('a' ..'z' | '0' ..'9')*;
Operator:
    '+'
    | '-'
    | '*'
    | '/'
    | '<'
    | '>'
    | '<='
    | '>='
    | '==';

WS: [ \n\t\r]+ -> skip;

I am constructing an AST using the visitor model, which will be evaluated separately. I have encountered an issue with the way ANTLR is parsing the input, and I'm not really sure even what to call it.

Issue 1

// incorrect_association.lambda

y = 1
x = 1

Assignment ( y = ( Application ( Literal ( 1 ) ) ( Assignment ( x = ( Literal ( 1 ) ) ) ) ) )

The AST should be

Assignment ( y = ( Literal ( 1 ) )
Assignment ( x = ( Literal ( 1 ) )

or

Grouping (
    Assignment ( y = ( Literal ( 1 ) ),
    Assignment ( x = ( Literal ( 1 ) )
)

Issue 2

I suppose this may tie into the first issue: expressions across multiple lines are being read as Application expressions.

// incorrect_application.lambda

x = 1
print x

Assignment ( x = ( Application ( Literal ( 1 ) ) ( PrintInstruction ( Identifier ( "x" ) ) ) ) )

The AST should be

Assignment ( x = ( Literal ( 1 ) )
PrintInstruction ( Identifier ( "x" ) )

or

Grouping (
    Assignment ( x = ( Literal ( 1 ) ),
    PrintInstruction ( Identifier ( "x" ) )
)

I am trying to have imperative-like constant assignments, with a functional-like execution. Eventually, the program should just be whatever main = ... (like Haskell). Is it possible to prevent the Application rule from matching two expressions that are on different lines, but continue to allow any other whitespace and parens?

Possible Solution

I was contemplating writing a preprocessor that would just throw semicolons on each line ending. I may need to do this anyway because I plan on adding

imports: 'import' Identifier | '(' imports ')';

as a grammar rule, and haven't found a nice solution for handling imports with ANTLR. If I were to go this route, how would I include ; line endings in my grammar?

PS: I am very new to ANTLR, so any guidance would be super helpful.


Solution

  • If you want newlines to be significant, then let them pass through the lexical scanner.

    WS: [ \t\r]+ -> skip;
    NL: [\n];
    

    Then you could define a program as a sequence of expressions terminated with newline characters:

    program: ( expression NL )*;
    

    If you want semicolons to work as well, just change the definition of NL:

    NL: [\n;];
    

    You'll also want to change body to accept multiple expressions, although it's not clear to me what kind of punctuation you will want to use. It's possible that

    body: expression (NL expression)*;
    

    will work for you, but it might produce unexpected results.

    Your application syntax is highly ambiguous. I have no idea what Antlr will make of it, but I can't interpret it. If you have

    + a b c
    

    That must be one of:

    (+ a b) (c)
    (+ a (b c))
    (+ (a b) c)
    

    But I don't see any indication of which of the three should be prefered. I would think that you would need to come up with a grammar which has more precise precedence.

    (There is a reason Lisp and Scheme use parentheses :-) )