Search code examples
antlr

Can a parser fail silently?


May ANTLR generated parsers fail silently? That is, can they omit diagnosing when not recognising?

Using a very small grammar for a demonstration and using defaults only for ANTLR, these are the contrasting observations:

When sending input to the usual test rig for the grammar below, I am noticing two things:

  1. the parsers recognize valid input (actions show that), o.K.;

  2. however, the recognisers seem to “accept” certain invalid(?) inputs, meaning there is no diagnosis. V3 and v4 parsers behave similarly. The issue—if there is an issue—appears when there are characters ('1') missing at the front of an input for stat, provided that prior to this input another input of just a NEWLINE had been sent.

This is the v4 grammar:

grammar Simp;

prog : stat+ ;
stat : '1' '+' '1' NEWLINE 
     | NEWLINE
     ;

NEWLINE : [\r]?[\n] ;

The v3 grammar is the same, mutatis mutandis.

Some runs using v4; class TestSimp4 is the usual test rig as in the book(s), see below:

% printf "1+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
% printf "+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4 
line 1:0 extraneous input '+' expecting {'1', NEWLINE}
line 1:2 mismatched input '\n' expecting '+'
% printf "\n+1\n" |java -classpath "antlr-4.11.1-complete.jar:." TestSimp4
%

The first two invocations' results I had expected. I had expected the last invocation to visibly fail, though. Correct?

Looking at the generated SimpParser.java, the silent exit seems consequential, as outlined below. But should it be that way? I am thinking that ANTLR just stops before recognising invalid input here, but it shouldn't just stop.

Question: Is this silent failure rather to be expected? Have I overlooked something like a greedyness setting for grammar tokens with a + suffix?

Some code analysis.

Referring to the loop that calls stat() (in the prog() procedure):

The v3 parser sets a counter variable to >= 1 on sucessfully matching the initial NEWLINE. The effect is that EarlyExitException is then not being thrown on later inputs, it just breaks the loop.

The v4 parser similarly calls _input.LA(1) and then just terminates the loop whenever that call’s result cannot be at the start of stat. (So no recovery?)

The test rig:

class TestSimp4 {
  public static void main(String[] args) throws Exception {
    final CharStream subject   = CharStreams.fromStream(System.in);
    final TokenSource tknzr    = new SimpLexer(subject);
    final CommonTokenStream ts = new CommonTokenStream(tknzr);
    final SimpParser parser    = new SimpParser(ts);
    parser.prog();
  }
}

So another paraphrase of my question would be: “How does one create ANTLR parsers such that they will always say YES or NO?”


Solution

  • Your 3rd test input, \n+1\n, does not produce an error because you're telling it to recognize the production/rule stat once or more. And prog successfully matches the input \n and then stops. If you want the entire input (token stream) to be consumed, "anchor" your prog rule with the EOF token:

    prog : stat+ EOF;