Search code examples
rustlr1

LR(1) Shift/Reduce Disambiguation


Given input with repeating BLOCKs, where each block has repeating BEGIN EVENT and END EVENT entries (END EVENT always follows a BEGIN EVENT):

[TIMESTAMP] BLOCK
[TIMESTAMP] BEGIN EVENT
[TIMESTAMP] END EVENT
[TIMESTAMP] BEGIN EVENT
[TIMESTAMP] END EVENT
...
[TIMESTAMP] BLOCK

How do you disambiguate this grammar with LR(1)? I'm using LALRPOP, and the minimum example of this is:

Timestamp = "[TIMESTAMP]";
BlockHeader = Timestamp "BLOCK";
Begin = Timestamp "BEGIN" "EVENT";
End = Timestamp "END" "EVENT";

Block = BlockHeader (Begin End)+;
pub Blocks = Block*

Because LR(1) can only look one token ahead though, this grammar is ambigious, as LALRPOP helpfully tells you (partial error):

Local ambiguity detected

  The problem arises after having observed the following symbols in the input:
    BlockHeader (Begin End)+
  At that point, if the next token is a `"[TIMESTAMP]"`, then the parser can proceed in two different ways.

  First, the parser could execute the production at
  /home/<snip>.lalrpop:51:9: 51:32, which would consume
  the top 2 token(s) from the stack and produce a `Block`. This might then yield a parse tree like
    BlockHeader (Begin End)+ Block
    ├─Block────────────────┤     │
    ├─Block+───────────────┘     │
    └─Block+─────────────────────┘

  Alternatively, the parser could shift the `"[TIMESTAMP]"` token and later use it to construct a
  `Timestamp`. This might then yield a parse tree like
    (Begin End)+ "[TIMESTAMP]" "BEGIN" "EVENT" End
    │            ├─Timestamp─┘               │   │
    │            └─Begin─────────────────────┘   │
    └─(Begin End)+───────────────────────────────┘

I see that it is telling me, after parsing a BlockHeader, Begin and End it can't figure out if the next token is another Begin, or the start of another Block. I have not found a way to disambiguate this in LR(1), but I can only assume this is a lack of understanding on my part though, and not an inherit limitation of LR(1) grammars?


Solution

  • Unfortunately this kind of 'need more lookahead' problem is hard to solve without completely restructuring the grammar, which often loses desirable structure of the input, and sometime accepts degenerate inputs the original grammar would reject. You can usually reject those inputs and get that structure back by post-processing the parse tree, but that is more work. In your case, the grammar:

    Timestamp = "[TIMESTAMP]";
    BlockHeader = Timestamp "BLOCK";
    Begin = Timestamp "BEGIN" "EVENT";
    End = Timestamp "END" "EVENT";
    Event = Begin End;
    Item = BlockHeader | Event;
    pub Input = Item*
    

    should do the trick, but has the problem that it loses the block structure (instead giving you an unstructured sequence of block headers and events), and it accepts empty blocks. You can readily deal with both problems by post-processing the list of items.

    An alternative option when the required lookahead is small and bounded is to deal with it in your tokenizer. I'm not familiar with LALRPOP, but it should be possible to "combine" the [TIMESTAMP] tokens with the immediately following keyword tokens (so the timestamps would not be present in the grammar, instead just being an attribute on the keywords), in which case everything would work fine with single token lookahead.