Search code examples
yacclexparser-generatorplyerror-recovery

Python Lex-Yacc (PLY) Error recovery at the end of input


Problem

I am trying to implement an error tolerant parser using Python Lex-Yacc (PLY), but I have trouble using error recovery rules at the end of my input string.

How can I recover from an unexpected end of input?

Example

This example grammar produces strings of the form A END A END A END A END ...

Statement   : Expressions

Expressions : Expression Expressions
            | 

Expression  : A END

I want to perform an error recovery if the END Token was omitted, so stings like A A A END or A A A will be recognized by the parser.

My approach

I added an error recovery rule, which allows me to accept input like A A A END

Expression : A END
           | A error

Which allows me to accept the following input: A A A END

But if the last END token is omitted (A A A), I still get a syntax error and cannot recover.


Sample PLY code

from __future__ import print_function

# Tokens
tokens = ('A', 'END')

t_A   = r'A'
t_END = r'END'
t_ignore = " "

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
import ply.lex as lex
lex.lex()

# Rules
def p_statement_expr(p):
    '''statement : expressions'''
    print("parsed:", p[1])

def p_expressions(p):
    '''expressions : expression expressions'''
    p[0] = [p[1]] + p[2]

def p_expressions_empty(p):
    '''expressions : '''
    p[0] = list()

def p_expression_pharse(p):
    '''expression : A END
                  | A error'''
    p[0] = 'A'

def p_error(p):
    if p:
        print("Syntax error at '%s'" % p.value)
    else:
        print("Syntax error at EOI")

import ply.yacc as yacc
yacc.yacc()

while 1:
    try:
        s = raw_input('query > ')   # use input() on Python 3
    except EOFError:
        break
    yacc.parse(s)

Solution

  • I add it as a new answer (and do know it is too late for the bounty :-( ) because it is a very different approach. If we used flex, it would be much easier, since it has the notion of the <<EOF>> token that matches only at end of file. After thinking about that, I realized that it was very simple to add that functionality to PLY without any change to the original module by using a proxy around the lexer. And Python allows easy implementation of proxies thanks the the __getattr__ special method.

    I just add

    • a new token EOF that will be send at end of file
    • a proxy around the token method of the lexer that on end of file returns the special EOF token on first pass and then the normal None
    • the eof token to end statement rule

    And still reverse the rule expressions : expressions expression instead of expressions : expression expressions to allow immediate reduce

    The code becomes :

    from __future__ import print_function
    
    # Tokens
    tokens = ('A', 'END', 'EOF')
    
    t_A   = r'A'
    t_END = r'END'
    t_ignore = " "
    
    def t_error(t):
        print("Illegal character '%s'" % t.value[0])
        t.lexer.skip(1)
    
    # Build the lexer
    import ply.lex as lex
    
    orig_lexer = lex.lex()
    
    class ProxyLexer(object):
        def __init__(self, lexer, eoftoken):
            self.end = False
            self.lexer = lexer
            self.eof = eoftoken
        def token(self):
            tok = self.lexer.token()
            if tok is None:
                if self.end :
                    self.end = False
                else:
                    self.end = True
                    tok = lex.LexToken()
                    tok.type = self.eof
                    tok.value = None
                    tok.lexpos = self.lexer.lexpos
                    tok.lineno = self.lexer.lineno
            # print ('custom', tok)
            return tok
        def __getattr__(self, name):
            return getattr(self.lexer, name)
    
    lexer = ProxyLexer(orig_lexer, 'EOF')
    
    # Rules
    def p_statement_expr(p):
        '''statement : expressions EOF'''
        print("parsed:", p[1])
    
    def p_expressions(p):
        '''expressions : expressions expression'''
        p[0] = p[1] + [p[2]]
    
    def p_expressions_empty(p):
        '''expressions : '''
        p[0] = list()
    
    def p_expression_pharse(p):
        '''expression : A END
                      | A error'''
        p[0] = 'A'
    
    def p_error(p):
        if p:
            print("Syntax error at '%s'" % p.value)
        else:
            print("Syntax error at EOI")
    
    import ply.yacc as yacc
    parser = yacc.yacc()
    
    while 1:
        try:
            s = raw_input('query > ')   # use input() on Python 3
        except EOFError:
            break
        parser.parse(s, lexer = lexer)
    

    That way :

    • the original grammar is unchanged
    • the error recovery method remains stupidly simple and has no dependance on the remaining of the grammar
    • it can be easily extended to complex parsers