Search code examples
pythonply

Reporting parse errors from PLY to caller of parser


So I've implemented a parser using PLY — but all the PLY documentation deals with parse and tokenization errors by printing out error messages. I'm wondering what the best way to implement non-fatal error-reporting is, at an API level, to the caller of the parser. Obviously the "non-fatal" restriction means exceptions are out — and it feels like I'd be misusing the warnings module for parse errors. Suggestions?


Solution

  • PLY has a t_error() function that you can override in your parser to do whatever you want. The example provided in the documentation prints out an error message and skips the offending character - but you could just as easily update a list of encountered parsing failures, have a threshold that stops after X amount of failures, etc. - http://www.dabeaz.com/ply/ply.html

    4.9 Error handling

    Finally, the t_error() function is used to handle lexing errors that occur when illegal characters are detected. In this case, the t.value attribute contains the rest of the input string that has not been tokenized. In the example, the error function was defined as follows:

    # Error handling rule
    def t_error(t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    

    You can utilize this by making your parser a class and storing error state within it - this is a very crude example since you'd have to make multiple MyLexer instances, then build() them, then utilize them for parsing if you wanted multiple lexers running concurrently.

    You could marry the error storage to the __hash__ of the lexer instance itself to only have to build once. I'm hazy on the details of running multiple lexer instances within one class, but really this is just to give a rough example of how you can capture and report non-fatal errors.

    I've modified the simple calculator class example from Ply's documentation for this purpose.

    #!/usr/bin/python
    
    import ply.lex as lex
    
    class MyLexer:
    
        errors = []
    
        # List of token names.   This is always required
        tokens = (
           'NUMBER',
           'PLUS',
           'MINUS',
           'TIMES',
           'DIVIDE',
           'LPAREN',
           'RPAREN',
        )
    
        # Regular expression rules for simple tokens
        t_PLUS    = r'\+'
        t_MINUS   = r'-'
        t_TIMES   = r'\*'
        t_DIVIDE  = r'/'
        t_LPAREN  = r'\('
        t_RPAREN  = r'\)'
    
        # A regular expression rule with some action code
        # Note addition of self parameter since we're in a class
        def t_NUMBER(self,t):
            r'\d+'
            t.value = int(t.value)
            return t
    
        # Define a rule so we can track line numbers
        def t_newline(self,t):
            r'\n+'
            t.lexer.lineno += len(t.value)
    
        # A string containing ignored characters (spaces and tabs)
        t_ignore  = ' \t'
    
        # Error handling rule
        def t_error(self,t):
            self.errors.append("Illegal character '%s'" % t.value[0])
            t.lexer.skip(1)
    
        # Build the lexer
        def build(self,**kwargs):
            self.errors = []
            self.lexer = lex.lex(module=self, **kwargs)
    
        # Test it output
        def test(self,data):
            self.errors = []
            self.lexer.input(data)
            while True:
                 tok = self.lexer.token()
                 if not tok: break
                 print tok
    
        def report(self):
            return self.errors
    

    Usage:

    # Build the lexer and try it out
    m = MyLexer()
    m.build()           # Build the lexer
    m.test("3 + 4 + 5")     # Test it
    print m.report()
    m.test("3 + A + B")
    print m.report()
    

    Output:

    LexToken(NUMBER,3,1,0)
    LexToken(PLUS,'+',1,2)
    LexToken(NUMBER,4,1,4)
    LexToken(PLUS,'+',1,6)
    LexToken(NUMBER,5,1,8)
    []
    LexToken(NUMBER,3,1,0)
    LexToken(PLUS,'+',1,2)
    LexToken(PLUS,'+',1,6)
    ["Illegal character 'A'", "Illegal character 'B'"]