May ANTLR generated parsers fail silently? That is, can they omit diagnosing when not recognising?
Using a very small grammar for a demonstration and using defaults only for ANTLR, these are the contrasting observations:
When sending input to the usual test rig for the grammar below, I am noticing two things:
the parsers recognize valid input (actions show that), o.K.;
however, the recognisers seem to “accept” certain invalid(?) inputs, meaning there is no
diagnosis. V3 and v4 parsers behave similarly. The issue—if there is
an issue—appears when there are characters ('1'
) missing
at the front of an input for stat
, provided that prior to this input another input of
just a NEWLINE had been sent.
This is the v4 grammar:
grammar Simp;
prog : stat+ ;
stat : '1' '+' '1' NEWLINE
| NEWLINE
;
NEWLINE : [\r]?[\n] ;
The v3 grammar is the same, mutatis mutandis.
Some runs using v4; class TestSimp4 is the usual test rig as in the book(s), see below:
% printf "1+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
% printf "+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
line 1:0 extraneous input '+' expecting {'1', NEWLINE}
line 1:2 mismatched input '\n' expecting '+'
% printf "\n+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
%
The first two invocations' results I had expected. I had expected the last invocation to visibly fail, though. Correct?
Looking at the generated SimpParser.java, the silent exit seems consequential, as outlined below. But should it be that way? I am thinking that ANTLR just stops before recognising invalid input here, but it shouldn't just stop.
Question: Is this silent failure rather to be expected? Have I
overlooked something like a greedyness setting for grammar tokens with a
+
suffix?
Referring to the loop that calls stat()
(in the
prog()
procedure):
The v3 parser sets a counter variable to >= 1 on sucessfully matching the initial
NEWLINE
. The effect is that EarlyExitException is then not being thrown on later inputs, it justbreak
s the loop.
The v4 parser similarly calls
_input.LA(1)
and then just terminates the loop whenever that call’s result cannot be at the start ofstat
. (So no recovery?)
The test rig:
class TestSimp4 {
public static void main(String[] args) throws Exception {
final CharStream subject = CharStreams.fromStream(System.in);
final TokenSource tknzr = new SimpLexer(subject);
final CommonTokenStream ts = new CommonTokenStream(tknzr);
final SimpParser parser = new SimpParser(ts);
parser.prog();
}
}
So another paraphrase of my question would be: “How does one create ANTLR parsers such that they will always say YES or NO?”
Your 3rd test input, \n+1\n
, does not produce an error because you're telling it to recognize the production/rule stat
once or more. And prog
successfully matches the input \n
and then stops. If you want the entire input (token stream) to be consumed, "anchor" your prog
rule with the EOF token:
prog : stat+ EOF;