Search code examples
javascriptc++bisonerror-recovery

Bison error recovery for automatic semicolon insertion


I'm trying to write a Bison C++ parser for parsing JavaScript files, but I can't figure out how to make the semicolon optional.

As to ECMAScript 2018 specification (https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf, chapter 11.9), semicolon isn't actually optional, instead it is inserted automatically during the parsing. In the specification, it is stated that:

When, as the source text is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true:

  • The offending token is separated from the previous token by at least one LineTerminator[...]

According to this, I'm trying to solve this problem in this naive way:

  • Detect the error, using the error special token;
  • Tell the lexer that a syntax error occurred, during the action; if it has encountered a newline character before the current token, the lexer will return a new semicolon token at the next yylex call; at the subsequent call, it will return the token that previously was the offending one when the syntax error occurred.

A very simplified structure of my parser is like the following:

program:
   stmt_list END
;

stmt_list:
    %empty
 |  stmt_list stmt
 |  stmt_list error  { /* error detected; tell the lexer about the syntax error */ }
;

stmt:
    value SEMICOLON
|   [other types of statements...]
;

value:
    NUMBER
|   STRING
;

But doing this way, in case the file contains a valid JavaScript statement without a terminating semicolon, but a newline character, when an offending token is encountered, the parser reduces the rest of the statement into an error special token. As I tell the lexer about the syntax error, the parser has already reduced the error token into stmt_list one and the previous valid instruction is lost, making the semicolon insertion useless.

Obviously I don't want to let my parser discard the valid statement and go to the next one.

How can I make this possible? Is this the right approach or am I missing something?


Solution

  • I don't believe this approach is workable.

    Just as a note, you would have to detect the error before any reduction takes place. So for semicolon insertion at the end of a statement, you need to add the error production to stmt, not stmt_list. So you would end up with something like this:

    stmt_list
         :  %empty
         |  stmt_list stmt
    
    stmt: value ';'   { handle_value_stmt(); }
        | value error { handle_value_stmt(); }
        | [other types of statements...]
    

    That doesn't insert a semicolon; it just pretends that the semicolon was inserted. (If a semicolon couldn't be inserted, then another error will be triggered.)

    But since it doesn't involve the lexer, it will happen whether or not the missing semicolon was at the end of a line, which is too enthusiastic. So the ideal solution would be to somehow tell the lexer to generate a semicolon token as the next token. But at the point where the error is detected, the lexer has already produced the lookahead token, and the parser knows what the lookahead token is. And it will use its recorded lookahead token to continue the parse.

    There's also the question of how it is possible to communicate with the lexer at this point, since Mid-Rule Actions don't really play well with the error recovery algorithm. In theory, you could use the fact that yyerror will be called to report the error, but that means that yyerror needs to be able to deduce that this is not a "real" error, which means it will have to go poking into yyparse's guts. (I'm sure this is possible but I don't know how to do it off the top of my head, and it doesn't seem to me to be recommendable.)

    Now, in theory it is possible to tell the parser to discard the lookahead token, and to tell the lexer to generate a semicolon followed by a repeat of the token it just sent. So it is just barely possible that by piling hack onto hack, you could make this work, if you're stubborn enough. But you'd end up with something very difficult to maintain, verify and test. (And making sure that it works in all corner cases will also be a challenge.)

    And that's without looking at the other cases where semicolons could be inserted.

    My approach to ASI was to simply analyse the grammar by figuring out which pairs of consecutive tokens are possible. (That's easy to do; you just need to construct FIRST and LAST sets, and then read through all the productions looking at consecutive symbols.) Then if the input consists of token A followed by one or more newlines followed by token B, and it is not possible for A to be followed by B in the grammar, then that's a candidate for semicolon insertion. The semicolon insertion might fail, but that will generate a syntax error, so you can't get a false positive. (You might have to fix the syntax error message, but at that point you at least know that you've inserted a semicolon.)

    Proving that that algorithm works is trickier, because it could theoretically be the case that A could be followed by B in some context but that it is not possible in the current context, while A ; B would be possible in the current context. In that case, you might miss a possible semicolon insertion. I haven't looked in detail at recent JS versions, but long ago when I wrote a JS lexer, I managed to prove to my own satisfaction that there are no such cases.


    Note: since the question was raised in a comment, I'll add a little hand-waving, although I really don't recommend following this approach.

    Without diving into bison's guts, it's really not possible to "unshift" a token, including the error token (which is a real token, more or less). By the time the error token has been shifted, the parse is effectively committed to an error production. So if you want to annul the error, you have to accept that fact and work around it.

    After an error token has been shifted, the parser will then skip tokens until a shiftable token is encountered. So if you've managed to insert an automatic semicolon into the token stream, you can use that token as a guard:

        stmt: value ';'       { handle_value_stmt(); }
            | value error ';' { handle_value_stmt(); }
    

    However, you might not have managed to insert an automatic semi-colon, in which case you really need to report the syntax error (and maybe attempt to resynchronise). The above rules would just silently drop tokens up to the next semicolon, which is certainly wrong. So a first approximation would be for your ASI inserter to always insert something, which can be used as a guard in the error productions:

        stmt: value ';'       { handle_value_stmt(); }
            | value error ';' { handle_value_stmt(); }
            | value error NO_ASI { handle_real_error(); }
    

    That's sufficient for "abort on error" processing, but if you want to do error recovery, you'll need to do some more hackery.

    As I said, I really don't recommend going down this route. The end result won't be pretty, even if it works (and you still might find that code which you thought worked fails on real user input, in a case you didn't consider.)