Search code examples
pythonyacclexbnfply

Syntax error using empty production rule in python PLY(lex/yacc)


The full example is given here:

import ply.lex as lex
import Property
# List of token names.   This is always required
tokens = [
    'CheckupInformation',
    'Introduction',
    'Information',
    'perfect',
    'sick',
    'LPAREN',
    'RPAREN',
    'CHAR',
    'NUMBER'
    ] 
def t_CheckupInformation(t)     : 'CheckupInformation'     ; return t
def t_Introduction(t)  : 'Introduction'  ; return t
def t_Information(t) : 'Information' ; return t
def t_perfect(t): 'perfect'; return t
def t_sick(t) : 'sick'; return t



t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_CHAR = r'[a-zA-Z_][a-zA-Z0-9_\-]*'
t_ignore = " \t"
# Define a rule so we can track line numbers

def t_NUMBER(t):
    r'[+\-0-9_][0-9_]*'
    t.lexer.lineno += len(t.value)
    try:
        t.value = int(t.value)
    except ValueError:
        print("Integer value too large %s" % t.value)
        t.value = 0
    return t
def t_SEMICOLON(t):
    r'\;.*'
    t.lexer.lineno += len(t.value)
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)
# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

 # Build the lexer
lexer = lex.lex()
# define upper level classes first     
class stat:
    def __init__(self):
        self.statement = ""
        self.intro = list()
        self.body = list()


P=stat()
def p_stat(p):
    'Stat : LPAREN CheckupInformation statIntro statBody RPAREN'
    p[0]=(p[1],p[2],p[3],p[4],p[5])

def p_Intro(p) : 
    '''statIntro : LPAREN Introduction Name RPAREN
                 | statIntro LPAREN Introduction Name RPAREN
                 | empty'''

    if len(p)==5:
       p[0] = (p[3])
    elif len(p)==6:
       p[0] = (p[4])
    else:
       p[0]= None
    P.intro.append(p[0])

def p_Name(p):
    'Name : CHAR'
    p[0]=p[1]



def p_Body(p):
    '''statBody : LPAREN Information bodyinfo RPAREN
                | statBody LPAREN Information bodyinfo RPAREN'''
    if len(p)==5:
       p[0] = (p[3])
    elif len(p)==6:
       p[0] = (p[4])
    P.body.append(p[0])
def p_bodyinfo(p):
    '''bodyinfo : LPAREN CHAR perfect RPAREN
                | LPAREN CHAR sick RPAREN'''
    p[0]=p[2],p[3]


def p_empty(p):
    'empty :  '
    print("This function is called")
    pass   
def p_error(p):
    print("Syntax error in input '%s'!" % p.value)

import ply.yacc as yacc
parser = yacc.yacc()
import sys
if len(sys.argv) < 2 :
    sys.exit("Usage: %s <filename>" % sys.argv[0])
fp = open(sys.argv[1])
contents=fp.read()
result=parser.parse(contents)

print("(CheckupInformation")
if (P.intro) != None:
    for x in range(len(P.intro)):
        print("    (Introduction %s)" %(P.intro[x]))
for x in range(len(P.body)):
        print("    (Information( %s %s))" %(P.body[x]))
print(")")

The text File1 is:

(CheckupInformation
  (Introduction John)
  (Introduction Patt)
  (Information(Anonymous1 perfect))
  (Information(Anonymous2 sick))
)

The text File2 is:

(CheckupInformation
  (Information(Anonymous1 perfect))
  (Information(Anonymous2 sick))
)

According to my bnf grammar 'Intro' is optional. The upper code with the text file1 works well. But when I remove the 'Intro' part from text file as text file2, it gives syntax error at 'body' section that means it cannot handle empty production. Please help me how to solve the error. How to handle the empty production rule for my code?

Error message: error message snip


Solution

  • Your program cannot be run because you import Property, which is not a standard library module. But deleting that line is at least sufficient to get to the point where Ply attempts to build the parser, at which point it generates several warnings, including a shift/reduce conflict warning. This last warning is important; you should attempt to fix it, and you certainly should not ignore it. (Which means that you should report it as part of your question.) It is this conflict which is preventing your parser from working.

    Here's what it says:

    WARNING: Token 'NUMBER' defined, but not used
    WARNING: There is 1 unused token
    Generating LALR tables
    WARNING: 1 shift/reduce conflict
    

    Ply generates the file parser.out, which includes more information about your grammar, including a detailed description of the shift/reduce conflict. Examining that file, we find the following:

    state 3
    
        (1) Stat -> LPAREN CheckupInformation . statIntro statBody RPAREN
        (2) statIntro -> . LPAREN Introduction Name RPAREN
        (3) statIntro -> . statIntro LPAREN Introduction Name RPAREN
        (4) statIntro -> . empty
        (10) empty -> .
    
      ! shift/reduce conflict for LPAREN resolved as shift
        LPAREN          shift and go to state 4
    
      ! LPAREN          [ reduce using rule 10 (empty -> .) ]
    
        statIntro                      shift and go to state 5
        empty                          shift and go to state 6
    

    The parser enters State 3 when it is at this point in the processing of Stat:

    Stat -> LPAREN CheckupInformation . statIntro statBody RPAREN
    

    The dot between CheckupInformation and statIntro indicates the progress. There might be more than one production in a state with dots in the middle, which means that the parser has not yet had to figure out which of those alternatives to pick. There are also productions with the dot at the beginning; these will correspond to the non-terminal(s) which immediately follow the dot(s), indicating that those productions now need to be considered.

    There may also productions with the dot at the end, which indicates that at this point in the parse, the sequence of symbols encountered can be "reduced" to the corresponding non-terminal. In other words, the parser has recognised that non-terminal.

    Reductions must be performed when they are recognised, or not at all. A reduction might not be performed, if the following token -- the "lookahead token" -- cannot follow the non-terminal to be reduced. In general, the parser needs to consider the following questions, which can be immediately answered by consulting the state transition table (these are shown immediately following the productions):

    1. Can the parse progress by continuing with the next token, without performing a reduction? (This is called a "shift" action, because the parser shifts one token to the right in the active production(s).)
    2. For each possible reduction in this state, can the parse progress by performing that reduction?

    A conflict occurs if the answer to more than one of these questions is "yes". That doesn't necessarily mean that the grammar is ambiguous, but it does mean that the parser cannot decide how to choose between the two alternatives.

    Like most parser generators, Ply resolves this question using some built-in rules. One of these rules is that in the absence of other information (precedence declarations), if the answer to the first question was "yes", the parser should proceed without performing any reductions.

    In the particular example of this state, the reduction which could be made is empty :. Although it's not obvious from this state (we'd have to look at the state the parser enters after doing that reduction, which is state 6), after reducing empty, the parser's next move will necessarily be to reduce statIntro -> empty, after which it will go to State 5, which includes the production

    Stat -> LPAREN CheckupInformation statIntro . statBody RPAREN
    

    In order for that sequence to be valid, the parser needs to know that it will be able to progress, which means that the lookahead token (in this case () must be a possible input in State 5. Of course, it is because statBody can start with an open parenthesis. So the reduction could be taken.

    But statIntro could also begin with a (, so the parser does not have to do the reduction in order to progress. Given those two alternatives, Ply chooses to take the shift action, which means that it discards the possibility that statIntro could be empty and assumes that the ( belongs to a statIntro. If there is a statIntro, this is the correct choice. But if statIntro was missing, the ( belongs to statBody, and the reduction should have been taken.

    So that's the problem with your grammar. It's an indication that the grammar, as written, needs more than one token of lookahead. Unfortunately, many parser generators, including Ply, do not have a mechanism to cope with grammars which need more than one lookahead token. (If there is some limit to the amount of lookahead needed -- in this case, for example, the conflict could be resolved by looking at the next two tokens -- then it is theoretically possible to find an equivalent grammar for the same language which needs only one lookahead token. But that will have to be your responsibility, because Ply won't do it for you.)

    In this case, the solution is extremely simple. It is only necessary to remove the empty production from statIntro, and instead make it optional by providing two productions for Stat, one which has a statIntro and one which doesn't:

    def p_stat_1(p):
        'Stat : LPAREN CheckupInformation statIntro statBody RPAREN'
        p[0]=(p[1],p[2],p[3],p[4],p[5])
    
    def p_stat_2(p):
        'Stat : LPAREN CheckupInformation           statBody RPAREN'
        p[0]=(p[1],p[2],None,p[3],p[4])
    
    def p_Intro(p) :
        '''statIntro : LPAREN Introduction Name RPAREN
                     | statIntro LPAREN Introduction Name RPAREN
        '''
    

    (I also removed p_empty from the grammar.)

    This modified grammar does not produce any conflicts, and will parse your test inputs as expected:

    $ python3 learner.py learner.1
    (CheckupInformation
        (Introduction John)
        (Introduction Patt)
        (Information( Anonymous1 perfect))
        (Information( Anonymous2 sick))
    )
    $ python3 learner.py learner.2
    (CheckupInformation
        (Information( Anonymous1 perfect))
        (Information( Anonymous2 sick))
    )
    

    Postscript:

    The transformation suggested above is simple and will work in a large number of cases, not just cases where the conflict can be resolved with a lookahead of two tokens. But, as noted in a comment, it does increase the size of a grammar, particularly when productions have more than one optional component. For example, the production:

    A : optional_B optional_C optional_D
    

    would have to be expanded into seven different productions, one for each non-empty selection from B C D, and in addition each place where A was used would need to be duplicated to allow for the case where the A was empty.

    That seems like a lot, but it might be that not all of these productions are necessary. The transformation is only needed if there is an overlap between the set of terminals which can start the optional component and the set of symbols which can follow it. So, for example, if B, C and D can all start with a parenthesis but A cannot be followed by a parenthesis, then optional_D will not cause a conflict, and only B and C would need to be expanded:

    A : B C optional_D
      |   C optional_D
      | B   optional_D
      |     optional_D
    

    That requires a bit of grammatical analysis to figure out what can follow A, but in common grammars that's not too hard to do by hand.

    If that still seems like too many productions, there are a couple of other possibilities which are less general, but which might help anyway.

    First, you might decide that it doesn't really matter what order B, C and D are presented in the above production. In that case, you could replace

    A : optional_B optional_C optional_D
    

    with the not-too-complicated, somewhat more accepting alternative:

    A : 
      | A X
    X : B | C | D
    

    (or you could avoid X by writing out the alternatives individually in productions of A.)

    That allows multiple uses of B, C and D, but that appears to correspond to your grammar, in which the optional components are actually possibly empty repetitions.

    That leaves the problem of how to produce a reasonable AST, but that's fairly easy to solve, at least in the context of Ply. Here's one possible practical solution, again assuming that repetition is acceptable:

    # This solution assumes that A cannot be followed by
    # any token which might appear at the start of a component
    def p_A(p):
        """ A : A1 """
        # Create the list of lists B, C, D, as in the original
        p[0] = [ p[1]["B"], p[1]["C"], p[1]["D"] ]
    
    def p_A1_empty(p):
        """ A1 : """
        # Start with a dictionary of empty lists
        p[0] = { "A":[], "B":[], "C":[] }
    
    def p_A1_B(p):
        """ A1 : A1 B """
        p[1]["B"].append(p[2])
        p[0] = p[1]
    
    def p_A1_C(p):
        """ A1 : A1 C """
        p[1]["C"].append(p[2])
        p[0] = p[1]
    
    def p_A1_D(p):
        """ A1 : A1 D """
        p[1]["D"].append(p[2])
        p[0] = p[1]
    

    You could simplify the last three action functions if you arranged for the semantic values of B, C and D to include an indication of what they are. So if, for example, B returned ["B", value] instead of value, then you could combine the last three A1 actions into a single function:

    def p_A1_BCD(p):
        """ A1 : A1 B
               | A1 C
               | A1 D
        """
        p[1][p[2][0]].append(p[2][1])
        p[0] = p[1]
    

    If none of that is satisfactory, and all of the conflicts can be resolved with one additional lookahead token, then you can try to solve the issue in the lexer.

    For example, your language seems to entirely consist of S-expressions which start with a open parenthesis followed by some kind of keyword. So the lexical analyser could combine the open parenthesis with the following keyword into a single token. Once you do that, your optional components no longer start with the same token and the conflict vanishes. (This technique is often used to parse XML inputs, which have the same issue: if everything interesting starts with a <, then conflicts abound. But if you recognise <tag as a single token, then the conflicts disappear. In the case of XML, whitespace is not allowed between the < and the tagname; if your grammar does allow whitespace between the ( and the following keyword, your lexer patterns will become slightly more complicated. But it's still manageable.)