Search code examples
c#antlr4antlr4cs

ANTLR4 in C# catches only one token


g4 file:

grammar TestFlow;

options
{
    language=CSharp4;
    output=AST;
}

/*
 * Parser Rules
 */

compileUnit : LC | BC ;

/*
 * Lexer Rules
 */

BC  : '/*' .*? '*/' ;

LC  : '//' .*? [\r\n] ;

Code:

var input = "   /*aaa*/   ///   \n   ";

var stream = new AntlrInputStream(input);
ITokenSource lexer = new TestFlowLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
var parser = new TestFlowParser(tokens);
parser.BuildParseTree = true;
var tree = parser.compileUnit();
var n = tree.ChildCount;
var top = new List<string>();
for (int i = 0; i < n; i++) {
    top.Add(tree.GetChild(i).GetText());
}

After running above code I get single string in top: /*aaa*/. The single-line comment isn't caught.

What's wrong?


Solution

  • All parser/lexer generation errors & warnings are significant. Both options statements are invalid in the current version of Antlr4.

    The runtime errors detail the root problem: unrecognizable input characters, specifically, the grammar does not handle whitespace. Add a lexer rule to fix:

    WS: [ \r\n\t] -> skip ;
    

    While not necessarily a problem, it is good form to require the parser to process all input. The lexer will generate an EOF token at the end of the source input. Fix the main rule to require the EOF:

    compileUnit : ( LC | BC ) EOF ;
    

    The correct way to allow for repetition is to use a * or + operator:

    compileUnit : ( LC | BC )+ EOF ;