Search code examples
parsingyacclalr

Meaning of YACC expression using yysindex and yyrindex


In a yacc-based project I've encountered a complex expression which I don't understand what it does. The expression is repeated multiple times, so it looks like copy-and-paste. In fact the same exact expression occurs inside the YACC skeleton (byacc-1.9), so I'm assuming it has some particular meaning.

Here's the expression:

if (((yyn = yysindex[lastyystate]) && (yyn += tok) >= 0 &&
             yyn <= YYTABLESIZE && yycheck[yyn] == tok) ||
            ((yyn = yyrindex[lastyystate]) && (yyn += tok) >= 0 &&
             yyn <= YYTABLESIZE && yycheck[yyn] == tok)) {

If you partition this you get

((yyn = yysindex[lastyystate]) && (yyn += tok) >= 0 && yyn <= YYTABLESIZE && yycheck[yyn] == tok)

OR

((yyn = yyrindex[lastyystate]) && (yyn += tok) >= 0 && yyn <= YYTABLESIZE && yycheck[yyn] == tok))

I'm fairly familiar with parser generators and know that yacc is LALR(1). I'm guessing

  • yysindex is the "shift table"
  • yyrindex is the "reduce table"
  • yyn <= YYTABLESIZE is just range checking

Given the similarity of the two parts, and assuming my guesses are right, it would seem like they both look for something in the packed/coded parsing tables. I haven't digged into the details on how yacc stores the parsing tables, if someone knows more about this that would probably help.

In this context tok is (obiviously) the token number. But why the += tok and what is the yycheck table?

This code is a part of source code completion, say using TAB, if that helps to explain things.

Extra points if you can explain in a single sentence, as in give a function name, what the intention of this complex expression is.


Solution

  • "Next+check" transition table compression, also commonly referred to as the comb algorithm, is described in the Dragon book with respect to lexers in Section 3.9, with a note in chapter 4 about its use with parser tables.

    The algorithm stores a sparse matrix by overlaying rows so that valid entries are overlapped only with missing entries. To look up a value in the compressed table you need a vector of the starting index for each row. Then you add the column number you're looking for to that starting index. Finally, you need to know whether the entry corresponds to the row you're looking in; the check vector contains the number of the row for which each table entry is valid.

    It's called comb compression from the idea that each row is like a comb with many broken tines.

    In the parser framework, that code (or something similar to it) will be used to ascertain the parser action corresponding to a token. Possible answers are:

    1. Shift the token and go to state S. (Push the token onto the stack and read a new lookahead).

    2. Reduce the stack using production number P. (Pop the right-hand side off the stack, execute the reduction action, go to the state found by consulting the GOTO table from the state revealed by the pop and the number of the reduced non-terminal, push the reduced non-terminal onto the stack, and then continue with the same lookahead token.)

    3. Signal an error.

    4. Accept the input. (Only if the lookahead token is the end-of-input marker. This possibility is often special cased.)

    I guess that you are right about their being separate shift and reduce action tables. It's possible that the rows overlap better if you compress the actions separately, although the compression would have to be a lot better to compensate for the extra check array.

    Given that the code is used for constructing a completion list, I suppose that it is being used in a simulation of the parse for each possible next token, in order to decide what the valid next token candidates are. The statement returns true if the token can be acted upon (if not, the candidate can simply be removed from the completion list) and sets yyn to the next action. It will be necessary to distinguish between Shift and Reduce actions. In some parser frameworks, the sign of the action is used for this purpose but there are other possibilities.

    So I'd call the function find_parse_action (or find_parse_action_if_any, if you like wordier names).

    Note that in an LALR parser, the simulation will need to iteratively apply reduce actions because the acceptability of a token isn't known until a shift action is actually encountered. Simulating reductions involves popping the stack which would be destructive if applied to the real parser stack. So part of the code probably creates and manages a simulated stack. (Although it's also possible that byacc uses a spaghetti stack. I've never actually looked at its skeleton.)

    Bison uses similar code to implement "Lookahead Correction" (LAC), which is used to produce more informative error messages. ("Expecting X or Y or Z", which is another completion list activity.) So you can see another possible implementation technique in the Bison skeleton. IIRC, it makes a copy of the stack in order to simulate pops.