Search code examples
cbisonyacc

How to recover from error in Bison in this LALR grammar?


I am trying to find out how to correctly recover from error in Bison. The thing is, when the input is correct, everything works until the input is incorrect. In that case it determines all next inputs as incorrect even though they are correct. Does anybody know how to solve this?

%{
    int yyerror(char *s);
    int yylex(void);
   //#define YYDEBUG 1
   #include <stdio.h>
   int linenum = 0;
%}


%token a_TOK b_TOK c_TOK d_TOK e_TOK f_TOK g_TOK h_TOK newline_TOK any_TOK

%%
X: S
    | error newline_TOK { yyerrok; yyclearin; };
S: A B LF X;
A: a_TOK h_TOK | a_TOK b_TOK A c_TOK | a_TOK B c_TOK;
B: C D E;
LF: newline_TOK {
        linenum += 1;
        printf("Correct derivation - line %d!\n", linenum);
    };
C: d_TOK | e_TOK;
D: f_TOK D | f_TOK;
E: B | g_TOK;
%%

int yyerror(char *s)
{
    linenum += 1; 
    printf("Syntax error - line %d\n", linenum);
    return 0;
}

int main(void)
{
#if YYDEBUG
  yydebug = 1;
#endif
    if(!yyparse())
        printf("End of input reached\n");
    return 0;
}


Solution

  • There are three issues with your current grammar and error recovery:

    • your top level X rule with the error production is not recursive, so after an error recovery, the only valid continuation is the end of the input. Anything other than an EOF will cause another error (which will discard up to the next newline to recover, and then repeat the process.)

    • your recursive S rule is (indirectly) right recursive, so the entire input must be recognized and pushed on the stack before S can be reduced, consuming the lines right to left. So again, error recovery can only happen at the end of the input.

    • your recursive S rule has no base case, so it can never actually end without an error -- an input without any errors will get a syntax error on reaching an EOF.

    The fix is to make your top level rule left recursive (with a base case), and (while you at it, though it is strictly not necessary, just simpler) get rid of the X rule. So you have

    S : /* epsilon */
      | S A B LF
      | S error newline_TOK { yyerrok; yyclearin; }
      ;
    

    Now after recovering from an error, it can parse more A B LF lines.

    You might think the S in the error production is odd, and it is fact not necessary (could be removed) without any change to the behavior, unless you want to add an action to the S A B LF rule that consumes $1 (perhaps to link all the parsed lines together into some sort of list or vector data structure). In that case, the S in the error production is needed to pass the previously parsed lines through an error recovery to the lines parsed afterwards.