Search code examples
antlrantlr4

Trying to understand why expected tokens in a 'mismatched input' ANTLR error does not include some tokens


Just started playing around with ANTLR and trying to understand an error message I'm getting when attempting to parse erroneous input. This is my (simple) grammar:

grammar Playground;

stmtList: (expr EOS)+;
expr:
IDENTIFIER ('!' | '^') IDENTIFIER
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| INT
| IDENTIFIER;

MAKE: 'make';
INT: '0' | [1-9] [0-9]*;
IDENTIFIER: [a-zA-Z0-9]+;
EQUAL: '='; // Dummy token that can be recognised
EOS: '\r'? '\n';
WS: [ \t\n\r]+ -> skip;

This is the text I'm attempting to parse:

blah=blah

Again, I know this text does not match the grammar defined. The error I'm getting is as follows:

line 1:4 mismatched input '=' expecting {'*', '/', '+', '-', EOS}

My question is - how come the expected set of tokens ANTLR recommends does not include tokens like '!' and '^' which are also defined in the first alternative of the expr rule? I feel like I'm missing some fundamental knowledge here. Any help is appreciated!

My expectation was to see an error message that looked like this:

line 1:4 mismatched input '=' expecting {'!', '^', '*', '/', '+', '-', EOS}

with the '!' and '^' tokens included in the expected set of tokens.

I'm reading The Definitive ANTLR 4 Reference at the moment, and I've also tried generating the tokens using ANTLR's TestRig.

Running grun Playground stmtList -tokens on blah=blah gives me the following output:

[@0,0:3='blah',<IDENTIFIER>,1:0]
[@1,4:4='=',<'='>,1:4]
[@2,5:8='blah',<IDENTIFIER>,1:5]
[@3,9:9='\n',<EOS>,1:9]
[@4,10:9='<EOF>',<EOF>,2:0]

ANTLR Version: 4.11.1


Solution

  • This is because you have two alternatives that start with IDENTIFIER in your expr rule. So they can both be matched with your first identifier. And this is what actually happens. The first blah is matched as IDENTIFIER and the parser tries the first alt in expr. This fails because the next token is EQUAL, so it tries the next alt which starts with the left recursive expr rule and does the same attempt again. expr matches IDENTIFIER because of your last alt, so the next step is to match the operators, which all fail. In the end you get the expected tokens for the 2 left recursive alts, because their first part matches:

    enter image description here

    If you remove the last alt in your expr rule the outcome is as you expect:

    enter image description here

    because now there is no match of expr with a single IDENTIFIER, so the first alt is what is taken to report the error.