Just started playing around with ANTLR and trying to understand an error message I'm getting when attempting to parse erroneous input. This is my (simple) grammar:
grammar Playground;
stmtList: (expr EOS)+;
expr:
IDENTIFIER ('!' | '^') IDENTIFIER
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| INT
| IDENTIFIER;
MAKE: 'make';
INT: '0' | [1-9] [0-9]*;
IDENTIFIER: [a-zA-Z0-9]+;
EQUAL: '='; // Dummy token that can be recognised
EOS: '\r'? '\n';
WS: [ \t\n\r]+ -> skip;
This is the text I'm attempting to parse:
blah=blah
Again, I know this text does not match the grammar defined. The error I'm getting is as follows:
line 1:4 mismatched input '=' expecting {'*', '/', '+', '-', EOS}
My question is - how come the expected set of tokens ANTLR recommends does not include tokens like '!' and '^' which are also defined in the first alternative of the expr rule? I feel like I'm missing some fundamental knowledge here. Any help is appreciated!
My expectation was to see an error message that looked like this:
line 1:4 mismatched input '=' expecting {'!', '^', '*', '/', '+', '-', EOS}
with the '!' and '^' tokens included in the expected set of tokens.
I'm reading The Definitive ANTLR 4 Reference at the moment, and I've also tried generating the tokens using ANTLR's TestRig.
Running grun Playground stmtList -tokens
on blah=blah
gives me the following output:
[@0,0:3='blah',<IDENTIFIER>,1:0]
[@1,4:4='=',<'='>,1:4]
[@2,5:8='blah',<IDENTIFIER>,1:5]
[@3,9:9='\n',<EOS>,1:9]
[@4,10:9='<EOF>',<EOF>,2:0]
ANTLR Version: 4.11.1
This is because you have two alternatives that start with IDENTIFIER
in your expr
rule. So they can both be matched with your first identifier. And this is what actually happens. The first blah
is matched as IDENTIFIER
and the parser tries the first alt in expr
. This fails because the next token is EQUAL
, so it tries the next alt which starts with the left recursive expr
rule and does the same attempt again. expr
matches IDENTIFIER
because of your last alt, so the next step is to match the operators, which all fail. In the end you get the expected tokens for the 2 left recursive alts, because their first part matches:
If you remove the last alt in your expr
rule the outcome is as you expect:
because now there is no match of expr
with a single IDENTIFIER
, so the first alt is what is taken to report the error.