Search code examples
javaparsingcup

Java CUP (Parser) produces shift/reduce conflict when Type of variable or function is user defined


My grammer needs to have user defined Type ID combinations. The problem with the code below is that it generates the following:

 [java] Warning : *** Shift/Reduce conflict found in state #67
 [java]   between VariableDecls ::= (*) 
 [java]   and     Type ::= (*) ID 
 [java]   under symbol ID
 [java]   Resolved in favor of shifting.
 [java] Warning : *** Shift/Reduce conflict found in state #65
 [java]   between VariableDecls ::= (*) 
 [java]   and     Type ::= (*) ID 
 [java]   under symbol ID
 [java]   Resolved in favor of shifting.
 [java] Error : *** More conflicts encountered than expected -- parser generation aborted
 [java] ------- CUP v0.11b 20160615 (GIT 4ac7450) Parser Generation Summary -------
 [java]   1 error and 4 warnings
 [java]   51 terminals, 34 non-terminals, and 93 productions declared, 
 [java]   producing 190 unique parse states.
 [java]   2 terminals declared but not used.
 [java]   0 non-terminals declared but not used.
 [java]   0 productions never reduced.
 [java]   2 conflicts detected (0 expected).
 [java]   No code produced.

I have tried to get rid of E productions in my grammer in both VariableDecls and Stmts productions with no luck. I have also tried to produce ID with type, and then produce e production. package cup.example;

import java_cup.runtime.*;
import java.io.IOException;
import java.io.File;
import java.io.FileInputStream;

parser code {:
  TScanner scanner;
  Parser(TScanner scanner) { this.scanner = scanner; }
    public void syntax_error(Symbol cur_token) {
        System.out.println("WE ARE HERE");
        done_parsing();
    }
    public void unrecovered_syntax_error(Symbol cur_token) {
        System.out.println(cur_token.sym);
        System.out.println("[reject]");
    }
:}


scan with {: return scanner.next_token(); :};

/* Terminals (tokens returned by the scanner). */
terminal    BOOLN, DBL, _INT, STRING, NUL;
terminal    _IF, ELS, FR, WHLE;
terminal    INTCONST, DBLCONST, STRINGCONST, BOOLCONST;
terminal    ADDOP, SUBOP, MULOP, DIV, MOD;
terminal    LFTPRN, RTPRN, LFTBRACKET, RTBRC, LFTBRACE, RTBRACE;
terminal    LESS, LESSEQ, GRT, GRTEQ, EQL, NEQ;
terminal    AND, OR, NOT;
terminal    ASSIGN, SEMICOL, COMMA, DOT;
terminal    BRK, CLS, EXTNDS, IMPL, INTRFC, NEWAR;
terminal    PRNTLN, READLN, RTRN, _VOID, NW;
terminal    ID;

/* Non terminals */
non terminal    Program, Decls, Decl;
non terminal    VariableDecl, FunctionDecl, ClassDecl, InterfaceDecl;
non terminal    Variable, Type, Formals, Variables, Extends, Implements, Implement;
non terminal    Field, Fields, Prototype, StmtBlock, VariableDecls, Stmts, Stmt;
non terminal    OptionExpr, WhileStmt, ForStmt, BreakStmt;
non terminal    ReturnStmt, PrintStmt, Expr, Exprs, Lvalue, Call, Actuals, Constant;
non terminal    IfStmt;

/* Precedences */
precedence right ASSIGN;
precedence left OR;
precedence left AND;
precedence left EQL, NEQ;
precedence left LESS, LESSEQ, GRT, GRTEQ;
precedence left ADDOP, SUBOP;
precedence left MULOP, DIV, MOD;
precedence left NOT;
precedence left LFTBRACKET, DOT;
precedence left ELS;

/* Toy grammar */

start with Program;

Program ::=     
    Decls 
        {:  System.out.print("[reduce 1]"); System.out.print("[accept]"); done_parsing(); :};

Decls ::= 
    Decl
        {: System.out.print("[reduce 2]"); :} 
    | Decl Decls
        {: System.out.print("[reduce 3]"); :} ;

Decl ::=
    VariableDecl    
        {: System.out.print("[reduce 4]"); :} 
    | FunctionDecl  
        {: System.out.print("[reduce 5]"); :} 
    | ClassDecl         
        {: System.out.print("[reduce 6]"); :} 
    | InterfaceDecl 
        {: System.out.print("[reduce 7]"); :} ;

VariableDecl ::=
    Variable SEMICOL    
        {: System.out.print("[reduce 8]"); :} ;

Variable ::=
    Type ID
        {: System.out.print("[reduce 9]"); :} ;

Type ::=
    _INT 
        {: System.out.print("[reduce 10]"); :} 
    | DBL
        {: System.out.print("[reduce 11]"); :} 
    | BOOLN
        {: System.out.print("[reduce 12]"); :} 
    | STRING
        {: System.out.print("[reduce 13]"); :} 
    | Type LFTBRACKET RTBRC     
        {: System.out.print("[reduce 14]"); :}

    |  ID {: System.out.print("[reduce 15]"); :};


FunctionDecl ::= 
    Type ID LFTPRN Formals RTPRN StmtBlock 
        {: System.out.print("[reduce 16]"); :} 
    | _VOID ID LFTPRN Formals RTPRN StmtBlock
        {: System.out.print("[reduce 17]"); :} ;

Formals ::=
    // EMPTY
        {: System.out.print("[reduce 18]"); :} 
    | Variables
        {: System.out.print("[reduce 19]"); :} ;

Variables ::= 
    Variable
        {: System.out.print("[reduce 20]"); :}  
    | Variable COMMA Variables  
        {: System.out.print("[reduce 21]"); :} ;

ClassDecl ::= 
    CLS ID Extends Implements LFTBRACE Fields RTBRACE
        {: System.out.print("[reduce 22]"); :} ;

Extends ::=
    // EMPTY
        {: System.out.print("[reduce 23]"); :}
    | EXTNDS ID
        {: System.out.print("[reduce 24]"); :};

Implements ::= 
    // EMPTY
        {: System.out.print("[reduce 25]"); :}
    | Implement
        {: System.out.print("[reduce 26]"); :};

Implement ::= 
    IMPL ID 
        {: System.out.print("[reduce 27]"); :}
    | IMPL ID COMMA Implement
        {: System.out.print("[reduce 28]"); :};

Fields ::=  
    // EMPTY
        {: System.out.print("[reduce 29]"); :}
    | Field Fields
        {: System.out.print("[reduce 30]"); :};

Field ::= 
    VariableDecl
        {: System.out.print("[reduce 31]"); :} 
    | FunctionDecl  
        {: System.out.print("[reduce 32]"); :};

InterfaceDecl ::= 
    INTRFC ID LFTBRACE Prototype RTBRACE    
        {: System.out.print("[reduce 33]"); :};

Prototype ::=
    // EMPTY
        {: System.out.print("[reduce 34]"); :}
    | Type ID LFTPRN Formals RTPRN SEMICOL Prototype 
        {: System.out.print("[reduce 35]"); :}
    | _VOID ID LFTPRN Formals RTPRN SEMICOL Prototype
        {: System.out.print("[reduce 36]"); :};

StmtBlock ::= 
    LFTBRACE VariableDecls Stmts RTBRACE    
        {: System.out.print("[reduce 37]"); :};

VariableDecls ::=
        //EMPTY
        {:System.out.print("[reduce 38]"); :}
        |
         VariableDecl VariableDecls
        {: System.out.print("[reduce 39]"); :};

Stmts ::=
    // EMPTY
        {: System.out.print("[reduce 40]"); :}
    | Stmt Stmts
        {: System.out.print("[reduce 41]"); :};

Stmt ::= 
    OptionExpr SEMICOL 
        {: System.out.print("[reduce 42]"); :}
    | IfStmt 
        {: System.out.print("[reduce 43]"); :}
    | WhileStmt 
        {: System.out.print("[reduce 44]"); :}
    | ForStmt   
        {: System.out.print("[reduce 45]"); :}
    | BreakStmt
        {: System.out.print("[reduce 46]"); :}
    | ReturnStmt 
        {: System.out.print("[reduce 47]"); :}
    | PrintStmt 
        {: System.out.print("[reduce 48]"); :}
    | StmtBlock
        {: System.out.print("[reduce 49]"); :};

IfStmt ::= 
    _IF LFTPRN Expr RTPRN Stmt  
        {: System.out.print("[reduce 50]"); :} 
    |  _IF LFTPRN Expr RTPRN Stmt ELS Stmt  
        {: System.out.print("[reduce 51]"); :}; 

WhileStmt ::=
    WHLE LFTPRN Expr RTPRN Stmt
        {: System.out.print("[reduce 52]"); :};

ForStmt ::=
    FR LFTPRN OptionExpr SEMICOL Expr SEMICOL OptionExpr RTPRN Stmt 
        {: System.out.print("[reduce 53]"); :};

BreakStmt ::= 
    BRK SEMICOL
        {: System.out.print("[reduce 54]"); :};

ReturnStmt ::= 
    RTRN OptionExpr SEMICOL     
        {: System.out.print("[reduce 55]"); :};

PrintStmt ::= 
    PRNTLN LFTPRN Exprs RTPRN SEMICOL   
        {: System.out.print("[reduce 56]"); :};

Expr ::= 
    Lvalue ASSIGN Expr 
        {: System.out.print("[reduce 57]"); :}
    | Constant 
        {: System.out.print("[reduce 58]"); :}
    | Lvalue
        {: System.out.print("[reduce 59]"); :}
    | Call 
        {: System.out.print("[reduce 60]"); :}
    | LFTPRN Expr RTPRN 
        {: System.out.print("[reduce 61]"); :}
    | Expr ADDOP Expr   
        {: System.out.print("[reduce 62]"); :}
    | Expr SUBOP Expr 
        {: System.out.print("[reduce 63]"); :}
    | Expr MULOP Expr 
        {: System.out.print("[reduce 64]"); :}
    | Expr DIV Expr 
        {: System.out.print("[reduce 65]"); :}
    | Expr MOD Expr     
        {: System.out.print("[reduce 66]"); :}
    | Expr LESS Expr    
        {: System.out.print("[reduce 68]"); :}
    | Expr LESSEQ Expr
        {: System.out.print("[reduce 69]"); :}
    | Expr GRT Expr 
        {: System.out.print("[reduce 70]"); :}
    | Expr GRTEQ Expr 
        {: System.out.print("[reduce 71]"); :}
    | Expr EQL Expr 
        {: System.out.print("[reduce 72]"); :}
    | Expr NEQ Expr 
        {: System.out.print("[reduce 73]"); :}
    | Expr AND Expr 
        {: System.out.print("[reduce 74]"); :}
    | Expr OR Expr 
        {: System.out.print("[reduce 75]"); :}
    | NOT Expr 
        {: System.out.print("[reduce 76]"); :}
    | READLN LFTPRN RTPRN 
        {: System.out.print("[reduce 77]"); :}
    | NEWAR LFTPRN INTCONST COMMA Type RTPRN
        {: System.out.print("[reduce 78]"); :};

Lvalue ::= 
    ID
        {: System.out.print("[reduce 79]"); :}
    | Lvalue LFTBRACKET Expr RTBRC 
        {: System.out.print("[reduce 80]"); :}
    | Lvalue DOT ID
        {: System.out.print("[reduce 81]"); :};

Call ::= 
    ID LFTPRN Actuals RTPRN 
        {: System.out.print("[reduce 82]"); :}
    | ID DOT ID LFTPRN Actuals RTPRN
        {: System.out.print("[reduce 83]"); :};

Actuals ::=
    // EMPTY
        {: System.out.print("[reduce 84]"); :} 
    | Exprs 
        {: System.out.print("[reduce 85]"); :};

Exprs ::= 
    Expr
        {: System.out.print("[reduce 86]"); :}
    | Expr COMMA Exprs
        {: System.out.print("[reduce 87]"); :};

Constant ::= 
    INTCONST
        {: System.out.print("[reduce 88]"); :}
    | DBLCONST
        {: System.out.print("[reduce 89]"); :}
    | STRINGCONST 
        {: System.out.print("[reduce 90]"); :}
    | BOOLCONST
        {: System.out.print("[reduce 91]"); :};

OptionExpr ::= 
    //EMPTY
        {: System.out.print("[reduce 92]"); :}
    | Expr
        {: System.out.print("[reduce 93]"); :};

Solution

  • I presume that this is some variant of the popular "Decaf" language, often used in introductory CS courses.

    It's not really clear to me why CUP only reports two conflicts, since afaics there are four conflicts in your grammar. Perhaps the version you pasted is not the version which generated the error message you included in your question.

    The conflicts reported in the error message are the result of your use of right-recursion for both the list of variable declarations and the list of statements which make up a statement block.

    Conventional wisdom will tell you that right-recursion should be avoided whenever possible, because it uses an unbounded amount of parser stack. Left recursion, by contrast, uses a constant amount of parser stack. That's a good rule of thumb, but most of the time the choice between left- and right-recursion will be dictated by the syntax. So, for example, if you are writing a grammar for arithmetic expressions without using precedence declarations, you will use left recursion for left-associative operators (which is almost all of them) and right recursion for right-recursive operators (such as assignment operators in C, C++ and Java).

    Item lists can usually be written either way, since they will generally be collapsed into a vector rather than staying as a binary tree, so the normal case will be left recursion:

    x_list ::= x_list x_element |
               // EMPTY
               ;
    

    You'll also see a number of variants of the above pattern. If the list of itmes cannot be empty, for example, the first production would be x_list: x_element. You'd also have to make modifications if the elements needed to be followed by or separated by a token, so you'll often seen things like:

    // In the language this comes from, statements are *always* followed by
    // a semicolon. Most languages don't work that way, though.
    statement_list ::= statement_list statement T_SEMICOLON |
                       // EMPTY
                       ;
    
    // Parameter lists (and argument lists) are either empty or have the items
    // *separated* by commas. Because the comma is only present if there are at
    // least two items, we need to special-case the empty list:
    
    parameter_list ::= T_OPAREN T_CPAREN |
                       T_OPAREN parameters T_CPAREN
                       ;
    parameters     ::= parameter |
                       parameters T_COMMA parameter
                       ;
    

    Although I wrote those all as left-recursion, I could just as well have used right-recursion in these particular cases. But there is a subtle difference between the left-recursive parse and the right-recursive parse, which also affects the order of execution of parser actions. Consider the difference between:

    id_list ::= id_list ID |                id_list ::= ID id_list |
               // EMPTY                                 // EMPTY
               ;                                        ;
    

    Both of them accept a b c, but they accept them in different ways: ε •

               •3                               •3
              / \                              / \
             •2   c                            a  •2
            / \                                  / \
           •1   b                                b  •1
          / \                                      / \ 
         ε   a                                    c   ε
    

    Because the parser is bottom-up in both cases, the parser actions always execute starting at the bottom. That will cause the first (left-recursive) parser to execute actions in input order, while the second one will execute actions from right to left.

    Anyway, getting back to the problem. You have, in effect, the following grammar which will derive a possibly-empty sequence of declarations followed by a possibly-empty sequence of statements:

    StatementBody        ::= OBRACE VariableDeclarations Statements CBRACE 
    VariableDeclarations ::= VariableDeclaration VariableDelarations     | // EMPTY
    Statements           ::= Statement Statements                        | // EMPTY
    

    Taking into account the parse tree derived above, both Statements and Declarations need to effectively end with an empty production. In other words, before the parser can shift the first token in Statements, it needs to reduce an empty VariableDeclarations non-terminal. And that means that it needs to know exactly which token will be the first token in Statements.

    Unfortunately, that's not possible because both Statement and VariableDeclaration can start with ID. So if the parser has just gotten to the end of a VariableDeclaration and the lookahead token is ID, it cannot tell whether to switch to parsing Statements or continue to parse VariableDeclarations.

    Note that the situation would not improve if we changed both these lists to left-recursion, because in that case the parser would have to reduce an empty Statements non-terminal at exactly the same point. The only way to avoid making the parser guess where to insert an empty non-terminal is to put both empty non-terminals at the ends of the StatementBody. In other words, VariableDeclarations must be left-recursive, so that the empty VariableDeclarations is at the beginning, while Statements must be right-recursive, so that the empty Statements is at the end:

    StatementBody        ::= OBRACE VariableDeclarations Statements CBRACE 
    VariableDeclarations ::= VariableDeclarations VariableDelaration     | // EMPTY
    Statements           ::= Statement Statements                        | // EMPTY
    

    That won't quite work, though, because (for other reasons) the parser has to be able to know whether it is parsing a Statement or a VariableDeclaration starting with an ID by looking at the token immediately following the ID. And there it will run into the following non-determinism:

        b [ ] a;      // Declaration
        b [ 3 ] = a;  // Assignment
    

    These two possibilities can't be distinguished until the third token is seen. But the parser needs to know one token earlier, so that it can decide whether to turn b into an Lvalue.

    Resolving this conflict is more annoying. I believe the usual approach is to force the lexical scanner to do the work, by returning [ ] as a single token. That resolves the problem, to be sure -- with that change, a single open bracket always means that the parser is looking at an expression, while a [ ] pair always means a declaration. But it's awkward in the scanner; in particular, the scanner will need to be able to handle something like

    [ /* A comment */
      /* Another comment */ ]
    

    as a single [ ] token. (No-one would write such code, we hope, but it is legal.)

    And this leads us to the third shift-reduce conflict, which is the result of distinguishing between dotted assignments and dotted calls:

    a . b ( 3 ) ;
    a . b = 3 ;
    

    This is a much simpler problem though, and it can be solved by looking carefully at the reference grammar for Decaf. With your grammar, the call needs to match a production ID DOT ID OPAREN ..., while the assignment will be matching Lvalue DOT ID. In other words, when the DOT is the lookahead, the parser needs to decide whether or not to reduce a to Lvalue. That can be avoided by making the two right-hand sides more similar.