So I've implemented a parser using PLY — but all the PLY documentation deals with parse and tokenization errors by printing out error messages. I'm wondering what the best way to implement non-fatal error-reporting is, at an API level, to the caller of the parser. Obviously the "non-fatal" restriction means exceptions are out — and it feels like I'd be misusing the warnings
module for parse errors. Suggestions?
PLY has a t_error() function that you can override in your parser to do whatever you want. The example provided in the documentation prints out an error message and skips the offending character - but you could just as easily update a list of encountered parsing failures, have a threshold that stops after X amount of failures, etc. - http://www.dabeaz.com/ply/ply.html
4.9 Error handling
Finally, the t_error() function is used to handle lexing errors that occur when illegal characters are detected. In this case, the t.value attribute contains the rest of the input string that has not been tokenized. In the example, the error function was defined as follows:
# Error handling rule
def t_error(t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
You can utilize this by making your parser a class and storing error state within it - this is a very crude example since you'd have to make multiple MyLexer instances, then build() them, then utilize them for parsing if you wanted multiple lexers running concurrently.
You could marry the error storage to the __hash__
of the lexer instance itself to only have to build once. I'm hazy on the details of running multiple lexer instances within one class, but really this is just to give a rough example of how you can capture and report non-fatal errors.
I've modified the simple calculator class example from Ply's documentation for this purpose.
#!/usr/bin/python
import ply.lex as lex
class MyLexer:
errors = []
# List of token names. This is always required
tokens = (
'NUMBER',
'PLUS',
'MINUS',
'TIMES',
'DIVIDE',
'LPAREN',
'RPAREN',
)
# Regular expression rules for simple tokens
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_NUMBER(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
t_ignore = ' \t'
# Error handling rule
def t_error(self,t):
self.errors.append("Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.errors = []
self.lexer = lex.lex(module=self, **kwargs)
# Test it output
def test(self,data):
self.errors = []
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
def report(self):
return self.errors
Usage:
# Build the lexer and try it out
m = MyLexer()
m.build() # Build the lexer
m.test("3 + 4 + 5") # Test it
print m.report()
m.test("3 + A + B")
print m.report()
Output:
LexToken(NUMBER,3,1,0)
LexToken(PLUS,'+',1,2)
LexToken(NUMBER,4,1,4)
LexToken(PLUS,'+',1,6)
LexToken(NUMBER,5,1,8)
[]
LexToken(NUMBER,3,1,0)
LexToken(PLUS,'+',1,2)
LexToken(PLUS,'+',1,6)
["Illegal character 'A'", "Illegal character 'B'"]