Search code examples
parsingfortranantlr4lexerfortran2018

Trouble with ANTLR4 Fortran 2018 Grammar - Unexpected Errors and Mismatched Input


I've been working on creating an ANTLR4 grammar for Fortran 2018 based on the BNF rules provided in the J3 Fortran 2018 document. I've directly converted each rule mentioned in the document into ANTLR4 rules. However, I'm encountering some unexpected errors and mismatched input issues when running the grammar with a test program.

The goal is to parse Fortran code into an AST. I've ensured that I directly converted the rules from the document into ANTLR4 rules and removed any mutually left-recursion rules occurred after conversion.

Grammar:

grammar Fortran2018;

//LEXER

//Comment
LINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;
BLOCK_COMMENT: '/*' .*? '*/' -> skip;

//WhiteSpace
WS: [ \t\r\n]+ -> skip;


// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';

// R0002 Letter ->
//         A | B | C | D | E | F | G | H | I | J | K | L | M |
//         N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z' | 'a'..'z';

// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;


//R0003 RepChar
REPCHAR: NON_CONTROL_CHAR | ESCAPE_SEQUENCE;
NON_CONTROL_CHAR: ~[\u0000-\u001F];
ESCAPE_SEQUENCE: '\\' ('\\' | 'n' | 't' | '"');


LPAREN: '(';
...

Grammer File: https://github.com/AkhilAkkapelli/F2018Antlr4Grammer/blob/main/Fortran2018.g4

ANTLR4 Output:

akhil@KHUSHI:~/***$ antlr4 Fortran2018.g4

warning(154): Fortran2018.g4:1183:0: rule derivedTypeDef contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2077:0: rule blockConstruct contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2712:0: rule mainProgram contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2724:0: rule module contains an optional block with at least one alternative that can match an empty string
warning(154): Fortran2018.g4:2773:0: rule submodule contains an optional block with at least one alternative that can match an empty 
string

warning(154): Fortran2018.g4:2786:0: rule blockData contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2816:0: rule interfaceBody contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2816:0: rule interfaceBody contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2936:0: rule functionSubprogram contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2962:0: rule subroutineSubprogram contains an optional block with at least one alternative that can match an empty string

warning(154): Fortran2018.g4:2982:0: rule separateModuleSubprogram contains an optional block with at least one alternative that can match an empty string

Test Program: test.f90

PROGRAM TEST

IMPLICIT NONE

INTEGER :: a

a = 5

END program TEST

Output: grun Fortran2018 program -gui test.f90


line 5:11 mismatched input 'a' expecting NAME

enter image description here

Questions

  1. What could be the possible reasons for the "mismatched input" errors that I'm encountering?

  2. I've noticed that I received warnings about rules containing optional blocks that can match an empty string. Could these warnings be related to the errors I'm seeing?

  3. Are there any common pitfalls or gotchas when converting BNF rules to ANTLR4 syntax that I might have missed?

  4. Could anyone offer insights into the specific error messages I've provided? What might be causing these errors, and how could I go about resolving them?

What I'm Looking For: I appreciate any help and guidance in identifying and resolving the issues in my ANTLR4 grammar. Please let me know if you need additional information or context to better understand the problem.


Solution

  • Your rule:

    // R1315 position-edit-desc -> T n | TL n | TR n | n X
    positionEditDesc: 'T' n | 'TL' n | 'TR' n | n 'X';
    

    implicitly defines a T token. And your LETTER lexer rule only matches a single letter, so a "T" could match either that implicit token definition or the LETTER rule. ANTLR is creating an implicit Lexer that must have the T rule first, so that's the token you get.

    Your token stream:

    [@0,0:6='PROGRAM',<'PROGRAM'>,1:0]
    [@1,8:8='T',<'T'>,1:8]
    [@2,9:10='ES',<'ES'>,1:9]
    [@3,11:11='T',<'T'>,1:11]
    [@4,14:21='IMPLICIT',<'IMPLICIT'>,3:0]
    [@5,23:26='NONE',<'NONE'>,3:9]
    [@6,29:35='INTEGER',<'INTEGER'>,5:0]
    [@7,37:38='::',<'::'>,5:8]
    [@8,40:40='a',<LETTER>,5:11]
    [@9,43:45='END',<'END'>,7:0]
    [@10,47:53='PROGRAM',<'PROGRAM'>,7:4]
    [@11,55:55='T',<'T'>,7:12]
    [@12,56:57='ES',<'ES'>,7:13]
    [@13,58:58='T',<'T'>,7:15]
    [@14,59:58='<EOF>',<EOF>,7:16]
    

    In a grammar of this complexity you really want to separate the Lexer and the parser so that you are in more control over how the Lexer is generated.

    If advice is appreciated... This grammar seems to be a pretty painfully literal conversion of the EBNF grammar (with no concept of a Lexer/Parser separation of concerns). As a result, you have very few, simple (explicit) token rules which will produce MANY more tokens than you will really want. (The impression is that explicit token rules were only created when you "had to", but you also have a LOT of implicit Lexer rules (all of the places with literal strings in your parser rules create implicit Lexer rules))

    Before proceeding further, I'd suggest finding a good ANTLR tutorial and becoming well versed in the roles of the Lexer and Parser so that you can better identify which EBNF rules should be Lexer rules and which would be Parser rules.