Search code examples
rascal

Does a priority declaration disambiguate between alternative lexicals?


In my previous question, there was a priority > declaration in the example. It turned out not to matter because the solution there did not actually invoke priority but rather avoided it by making the alternatives disjoint. In this question, I'm asking whether priority can be used to select one lexical production over another. In the example below, the language of the production WordInitialDigit is intentionally a subset of that of WordAny. The production Word looks like it should disambiguate between the two properly, but the resulting parse tree has an ambiguity node at the top. Is a priority declaration able to decide between different lexical reductions, or does it require there to be a basis of common lexical elements? Or something else?

The example is contrived (there are no actions in the grammar), but the situations it arises from are not. For example, I'd like to use something like this for error recovery, where I can recognize a natural boundary for a unit of syntax and write a production for it. This generic production would be the last element in a priority chain; if it reduces, it means that there was no valid parse. More generally, I need to be able to select lexical elements based on syntactic context. I had hoped, since Rascal is scannerless, that this would be seamless. Perhaps it is, though I don't see it at the moment.

I'm on the unstable branch, version 0.10.0.201807050853.

EDIT: This question is not about > for defining an expression grammar. The documentation for priority declarations talks mostly about expressions, but the very first sentence provides what looks like a perfectly clear definition:

Priority declarations define a partial ordering between the productions within a single non-terminal.

So the example has two productions, an ordering declared between them, and yet the parser is still generating an ambiguity node in the clear presence of a disambiguation rule. So to put a finer point on my question, it looks like I don't know which of two situations pertains. Either (1) if this isn't supposed to work, then there's a defect in the language definition as documented, a deficiency in error reporting of the compiler, and a language design decision that's somewhere between counter-intuitive and user-hostile. Or (2) if this is supposed to work, there's a defect in the compiler and/or parser (presumably because the focus was initially on expressions) and at some point the example will pass its tests.

module ssce

import analysis::grammars::Ambiguity;
import ParseTree;
import IO;
import String;

lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
    WordInitialDigit
    > WordAny
    ;

test bool WordInitialDigit_0() = parseAccept( #Word, "4foo" );
test bool WordInitialDigit_1() = parseAccept( #WordInitialDigit, "4foo" );
test bool WordInitialDigit_2() = parseAccept( #WordAny, "4foo" );

bool verbose = false;

bool parseAccept( type[&T<:Tree] begin, str input )
{
    try
    {
        parse(begin, input, allowAmbiguity=false);
    }
    catch ParseError(loc _):
    {
        return false;
    }
    catch Ambiguity(loc l, str a, str b):
    {
        if (verbose)
        {
            println("[Ambiguity] #<a>, \"<b>\"");
            Tree tt = parse(begin, input, allowAmbiguity=true) ;
            iprintln(tt);
            list[Message] m = diagnose(tt) ;
            println( ToString(m) );
        }
        fail;
    }
    return true;
}

bool parseReject( type[&T<:Tree] begin, str input )
{
    try
    {
        parse(begin, input, allowAmbiguity=false);
    }
    catch ParseError(loc _):
    {
        return true;
    }
    return false;
}

str ToString( list[Message] msgs ) =
    ( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..]  );

str ToString( Message msg)
{
    switch(msg)
    {
        case error(str s, loc _): return "error: " + s;
        case warning(str s, loc _): return "warning: " + s;
        case info(str s, loc _): return "info: " + s;
    }
    return "";
}

Solution

  • Excellent questions.

    TL;DR:

    • the rule priority mechanism is not capable of an algorithmic ordering of a non-terminal's alternatives. Although some kind of partial order is involved in the additional grammatical constraints that a priority declaration generates, there is no "trying" one rule first, before the other. So it simply can't do that. The good news is that the priority mechanism has a formal semantics independent of any parsing algorithm, it's just defined in terms of context-free grammar rules and reduction traces.
    • using ambiguous rules for error recovery or "robust parsing", is a good idea. However, if there are too many such rules, the parser will eventually start showing quadratic or even cubic behavior, and tree building after parsing might even have higher polynomials. I believe the generated parser algorithm should have a (parameterized) mode for error recovery rather then expressing this at the grammar level.
    • Accepting ambiguity at parse time, and filtering/choosing trees after parsing is the recommended way to go.
    • All this talk of "ordering" in the documentation is misleading. Disambiguation is minefield of confusing terminology. For now, I recommend this SLE paper which has some definitions: https://homepages.cwi.nl/~jurgenv/papers/SLE2013-1.pdf

    Details

    priority mechanism not capable of choosing among alternatives

    The use of the > operator and left, right generates a partial order between mutually recursive rules, such as found in expression languages, and limited to specific item positions in each rule: namely the left-most and right-most recursive positions which overlap. Rules which are lower in the hierarchy are not allowed to be grammatically expanded as "children" of rules which are higher in the hierarchy. So in E "*" E, neither E may be expaned to E "+" E if E "*" E > E "+" E.

    The additional constraints do not choose for any E which alternative to try first. No they simply disallow certain expansions, assuming the other expansion is still valid and thus the ambiguity is solved.

    The reason for the limitation at specific positions is that for these positions the parser generator can "prove" that they will generate ambiguity, and thus filtering one of the two alternatives by disallowing certain nestings will not result in additional parse errors. (consider a rule for array indexing: E "[" E "]" which should not have additional constraints for the second E. This is a so-called "syntax-safe" disambiguation mechanism.

    All and all it is a pretty weak mechanism algorithmically, and specifically tailored for mutually recursive combinator/expression-like languages. The end-goal of the mechanism is to make sure we use have to use only 1 non-terminal for the entire expression language, and the parse trees looking very much akin in shape to abstract syntax trees. Rascal inherited all these considerations from SDF, via SDF2, by the way.

    Current implementations actually "factor" the grammar or the parse table in some fashion invisibly to get the same effect, as-if somebody would have factored the grammar completely; however these implementations under-the-hood are very specific to the parsing algorithm in question. the GLR version is quite different from the GLL version, which again is quite different from the DataDependent version.

    Post-parse filtering

    Of course any tree, including ambiguous parse forests produced by the parser, can be manipulated by Rascal programs using pattern matching, visit, etc. You could write any algorithm to remove the trees you want. However, this requires the entire forest to be constructed first. It's possible and often fast enough, but there is a faster alternative.

    Since the tree is built in a bottom-up fashion from the parse graph after parsing, we can also apply "rewrite rules" during the construction of the tree, and remove certain alternatives.

    For example:

    Tree amb({Tree a, *Tree others}) = amb(others) when weDoNotWant(a);
    Tree amb({Tree a}) = a;  
    

    This first rule would match on the ambiguity cluster for all trees, and remove all alternatives which weDoNotWant. The second rule removes the cluster if only one alternative is left and let's the last tree "win".

    If you want to choose among alternatives:

    Tree amb({Tree a, Tree b, *Tree others}) = amb({a, others} when weFindPeferable(a, b);
    

    If you don't want to use Tree but a more specific non-terminal like Statement that should also work.

    This example module uses @prefer tags in syntax definitions to "prefer" rules which have been tagged over the other rules, as post-parse rewrite rules:

    https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/sdf2/filters/PreferAvoid.rsc

    Hacking around with additional lexical constraints

    Next to priority disambiguation and post-parse rewriting, we still have the lexical level disambiguation mechanisms in the toolkit:

    • `NT \ Keywords" - rejecting finite (keyword) languages from a non-terminals
    • CC << NT, NT >> CC, CC !<< NT, NT !>> CC follow and preceede restrictions (where CC stands for character-class and NT for non-terminal)

    Solving other kinds of ambiguity apart from the operator precedence stuff can be tried with these, in particular if the length of different sub-sentences is shorter/longer between the different alternatives, !>> can do the "maximal munch" or "longest match" thing. So I was thinking out loud:

    lexical C = A? B?;
    

    where A is one lexical alternative and B is the other. With the proper !>> restrictions on A and !<< restrictions on B the grammar might be tricked into always wanting to put all characters in A, unless they don't fit into A as a language, in which case they would default to B.

    The obvious/annoying advice

    Think harder about an unambiguous and simpler grammar.

    Sometimes this means to abstract and allow more sentences in the grammar, avoiding use of the grammar for "type checking" the tree. It's often better to over-approximate the syntax of the language and then use (static) semantic analysis (over simpler trees) to get what you want, rather then staring at a complex ambiguous grammar.

    A typical example: C blocks with declarations only at the start are much harder to define unambiguously then C blocks where declarations are allowed everywhere. And for a C90 mode, all you have to do is flag declarations which are not at the start of a block.

    This particular example

    lexical WordChar = [0-9A-Za-z] ;
    lexical Digit = [0-9] ;
    lexical WordInitialDigit = Digit WordChar* !>> WordChar;
    lexical WordAny = WordChar+ !>> WordChar;
    syntax Word =
        WordInitialDigit
        | [0-9] !<< WordAny // this would help!
        ;
    

    wrap up

    Great question, thanks for the patience. Hope this helps!