Search code examples
grammarbnfambiguitylalrdisambiguation

Solve ambiguity in my grammar with LALR parser


I'm using whittle to parse a grammar, but I'm running into the classical LALR ambiguity problem. My grammar looks like this (simplified):

<comment> ::= '{' <string> '}'           # string enclosed in braces
<tag> ::= '[' <name> <quoted-string> ']' # [tagname "tag value"]
<name> ::= /[A-Za-z_]+/                  # subset of all printable chars
<quoted-string> ::= '"' <string> '"'     # string enclosed in quotes
<string> ::= /[:print:]/                 # regex for all printable chars

The problem, of course, is <string>. It contains all printable characters and is therefore very greedy. Since it's an LALR parser, it tries to parse a <name> as a <string> and everything breaks. The grammar complicates things because it uses different string delimiters for different things, which is why I tried to make the <string> rule in the first place.

Is there a canonical way to normalize this grammar to make it LALR compliant, if it's even possible?


Solution

  • This is not "the classical LALR ambiguity problem", whatever that might be. It is simply an error in the lexical specification of the language.

    I took a quick glance at the Whittle readme, but it didn't bear any resemblance to the grammar in the OP. So I'm assuming that the text in the OP is conceptual rather than literal, and the fact that it includes the obviously incorrect

    <string> ::= /[:print:]/                 # regex for all printable chars
    

    is just a typo.

    Better would have been /[:print:]*/, assuming that Ruby lets you get away with [:print:] rather than the Posix-standard [[:print:]].

    But that wouldn't be correct either because lexing (usually) matches the longest possible string, and consequently that will gobble up the closing quote and any following text.

    So the correct solution for quoted-string is to write it out correctly:

    <quoted-string> ::= /"[^"]*"/
    

    or even

    <quoted-string> :: /"([^\\"]|\\.)*"/
    # any number of characters other than quote or escape, or escaped pairs
    

    You might have other ideas about how to escape internal double quotes; those are just examples. In both cases, you need to postprocess the token in order to (at least) strip the double-quotes and possible interpret escape sequences. That's just the way it goes.

    Your comment sequences present a more difficult issue, assuming that your intention was that a comment might include nested braces (eg. {This comment {with this} ends here}) because the nested brace syntax is not regular and thus cannot be matched with a regular expression. Of course, very few "regular expression" libraries are really regular these days, and I don't know if Ruby contains some sort of brace-counting extension, like for example Lua's pattern syntax. The nested brace syntax is certainly context-free but to actually parse it you need to lexically analyze the contents of the outer {...} in a different way than the rest of the program.

    It is this latter observation, and not any weakness in the LALR algorithm, that is causing you pain, and I'd say that this is a weakness with the (mostly undocumented afaics) lexical analysis section of whittle. In a flex-generated lexer, for example, it would be normal to use start conditions to separate the lexical environments (program / quoted string / braced comment), and the parser would then have no ambiguity.

    Hope that helps.