Search code examples
parsingbisonyaccebnf

Converting Extended BNF to Bison grammar, but having shift/reduce errors


Background

I'm working on a compiler for a latex-like language. I've already written the lex file and that's working the way it ought to, so far. However, I've been running into problems now that I'm working on the grammar in the .y file.

Problem

I've reproduced the part of the grammar that I believe is responsible for stymieing me:

%start document
%%
document: BEGINDOCUMENT documentbody ENDDOCUMENT;
documentbody: contentseq | ws MAKETITLE contentseq | MAKETITLE contentseq;
contentseq: | contentseq content;
content: STRING | ws;
ws: WHITESPACE;

Whitespace, in this context, is basically any mix of spaces, tabs, and newlines.

As I understand it from looking at the y.output file, the shift/reduce error is raised because of the rule

documentbody: ... | ws MAKETITLE contentseq | ...

Given the WHITESPACE token, Bison doesn't know whether this is part of the terminal "content" or if it will instead be followed by a MAKETITLE token. Both are perfectly valid input, and I'm unsure of how to fix this issue.

For clarity, a paraphrase of the original EBNF specification:

document: BEGINDOCUMENT [ws] [MAKETITLE] contentseq ENDDOCUMENT

In other words, both the ws terminal and MAKETITLE are optional.

Example input

BEGINDOCUMENT WHITESPACE MAKETITLE STRING ENDDOCUMENT
BEGINDOCUMENT WHITESPACE STRING ENDDOCUMENT
BEGINDOCUMENT MAKETITLE STRING ENDDOCUMENT
BEGINDOCUMENT STRING ENDDOCUMENT

All of the above should be accepted by the grammar.

What I've tried

I know many conflicts can be smoothed over by using precedence, but nothing I've tried in that vein has worked. I've tried assigning the MAKETITLE and WHITESPACE tokens every kind of precedence, but that hasn't solved the problem.

I've seen suggestions on other shift/reduce-related problems to rewrite the grammar to be less ambigous, but I'm not sure how to go about doing that - at least without changing what input the grammar accepts and does not accept.

One solution I have thought of, but not attempted, is messing with the lex file, but that seems a rather icky solution and I would rather find some way of doing it in yacc.


Solution

  • The conflict basically results from the nullability of contentseq. That forces the parser to recognise an empty contentseq before it recognises a longer contentseq. And that produces a conflict when the input starts BEGINDOCUMENT WHITESPACE, because at the point before the WHITESPACE, it is not knowable whether that empty contentseq should be reduced.

    You can easily resolve that by making contentseq non-nullable (contentseq: content | contentseq content), at the cost of having to explicitly handle omitted sequences:

    documentbody: %empty | contentseq | maketitle optionalcs
    contentseq: content | contentseq content
    optionalcs: %empty | contentseq
    maketitle: WHITESPACE MAKETITLE | MAKETITLE
    

    This is a general problem with converting EBNF's optional syntax [ x ], particularly when x is repeated. You cannot always rely on being able to define optional-x; you often have to create two right-hand sides, one with x and the other without.


    I don't see the point of ws: WHITESPACE; you could just use a WHITESPACE token instead of the ws non-terminal. If your grammar is more complicated than what you show, that non-terminal could be triggering a conflict but I don't see any ambiguity in what you have pasted. Nonetheless, in the sample solution above, I removed the redundant non-terminal.