Search code examples
parsingantlr4semantics

Require newline or EOF after statement match


Just looking for a simple way of getting ANTLR4 to generate a parser that will do the following (ignore anything after the ;):

int #i ;    defines an int
int #j ;    see how I have to go to another line for another statement?

My parser is as the following:

compilationUnit:
    (statement END?)*
    statement END?
    EOF
;

statement:
    intdef |
    WS
;

// 10 - 1F block.

intdef:
    'intdef' Identifier
;

// Lexer.

Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];

// Whitespace, fragments and terminals.

WS: [ \t\r\n\u000C]+ -> skip;
//COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
END: (';' ~[\r\n]*) | '\n';

In essence, any time I have a statement, I need it to REQUIRE a newline before another is entered. I don't care if there's 3 new lines and then on the second one a bunch of tabs persist, as long as there's a new line.

The issue is, the ANTLR4 Parse Tree seems to be giving me errors for inputs such as:

.

(Pretend the dot isnt there, its literally no input)

int #i int #j

Woops, we got two on the same line!

Any ideas on how I can achieve this? I appreciate the help.


Solution

  • I've simplified your grammar a bit but made it require an end-of-line sequence after each statement to parse correctly.

    grammar Testnl;
    
    program: (statement )* EOF ;
    
    statement: 'int' Identifier EOL;
    
    Identifier: '#' Letter LetterOrDigit*;
    fragment Letter: [a-zA-Z_];
    fragment LetterOrDigit: [a-zA-Z0-9$_];
    
    EOL: ';' .*? '\r\n'
    | ';' .*? '\n'
    ;
    
    WS: [ \t\r\n\u000C]+ -> skip;
    

    It parses

    int #i ;
    int #j;
    
    
    [@0,0:2='int',<'int'>,1:0]
    [@1,4:5='#i',<Identifier>,1:4]
    [@2,7:9=';\r\n',<EOL>,1:7]
    [@3,10:12='int',<'int'>,2:0]
    [@4,14:15='#j',<Identifier>,2:4]
    [@5,16:18=';\r\n',<EOL>,2:6]
    [@6,19:18='<EOF>',<EOF>,3:0]
    

    It also ignore stuff after the semicolon as just part of the EOL token:

    [@0,0:2='int',<'int'>,1:0]
    [@1,4:5='#i',<Identifier>,1:4]
    [@2,7:20='; ignore this\n',<EOL>,1:7]
    [@3,21:23='int',<'int'>,2:0]
    [@4,25:26='#j',<Identifier>,2:4]
    [@5,27:28=';\n',<EOL>,2:6]
    [@6,29:28='<EOF>',<EOF>,3:0]
    

    using either linefeed or carriagereturn-linefeed just fine. Is that what you're looking for?

    EDIT

    Per OP comment, made a small change to allow consecutive EOL tokens, and also move EOL token to statement to reduce repetition:

    grammar Testnl;

    program: ( statement EOL )* EOF ;
    
    statement: 'int' Identifier;
    
    Identifier: '#' Letter LetterOrDigit*;
    fragment Letter: [a-zA-Z_];
    fragment LetterOrDigit: [a-zA-Z0-9$_];
    
    EOL: ';' .*? ('\r\n')+
    | ';' .*? ('\n')+
    ;
    
    WS: [ \t\r\n\u000C]+ -> skip;