Suppose I have a python-rply code that looks like this (taken from here):
from rply import ParserGenerator, LexerGenerator
from rply.token import BaseBox
lg = LexerGenerator()
# Add takes a rule name, and a regular expression that defines the rule.
lg.add("PLUS", r"\+")
lg.add("MINUS", r"-")
lg.add("NUMBER", r"\d+")
lg.ignore(r"\s+")
# This is a list of the token names. precedence is an optional list of
# tuples which specifies order of operation for avoiding ambiguity.
# precedence must be one of "left", "right", "nonassoc".
# cache_id is an optional string which specifies an ID to use for
# caching. It should *always* be safe to use caching,
# RPly will automatically detect when your grammar is
# changed and refresh the cache for you.
pg = ParserGenerator(["NUMBER", "PLUS", "MINUS"],
precedence=[("left", ['PLUS', 'MINUS'])], cache_id="myparser")
@pg.production("main : expr")
def main(p):
# p is a list, of each of the pieces on the right hand side of the
# grammar rule
return p[0]
@pg.production("expr : expr PLUS expr")
@pg.production("expr : expr MINUS expr")
def expr_op(p):
lhs = p[0].getint()
rhs = p[2].getint()
if p[1].gettokentype() == "PLUS":
return BoxInt(lhs + rhs)
elif p[1].gettokentype() == "MINUS":
return BoxInt(lhs - rhs)
else:
raise AssertionError("This is impossible, abort the time machine!")
@pg.production("expr : NUMBER")
def expr_num(p):
return BoxInt(int(p[0].getstr()))
lexer = lg.build()
parser = pg.build()
class BoxInt(BaseBox):
def __init__(self, value):
self.value = value
def getint(self):
return self.value
This is a simple code, so when you type this:
parser.parse(lexer.lex("1 + 3"))
It will execute, giving you 4
as an output and answer. This is a working code, but still needs improvement. The part of the code where @pg.production
is invoked for the addition and subtraction, is not very efficient; By that I mean that if you were to add a few more operators to that, it would get very cramped. Is there a good method to make a non-cramped version of that part that may look something like this:
@pg.production("expr : expr PLUS expr")
def plus(p):
lhs = p[0].getint()
rhs = p[2].getint()
if p[1].gettokentype() == "PLUS":
return BoxInt(lhs + rhs)
else:
raise AssertionError("This is impossible, abort the time machine!")
@pg.production("expr : expr MINUS expr")
def minus(p):
lhs = p[0].getint()
rhs = p[2].getint()
if p[1].gettokentype() == "MINUS":
return BoxInt(lhs - rhs)
else:
raise AssertionError("This is impossible, abort the time machine!")
If you split up the functions so that each production has its own function -- which is, indeed, best practice -- then thete is absolutely no point checking the token type of the operator. You know what it is because the logic of the parser means that the function will only be called with a match to the production.
So you can write reasonably compact code:
@pg.production("expr : expr PLUS expr")
def plus(p):
return BoxInt(p[0].getint() + p[2].getint())
@pg.production("expr : expr MINUS expr")
def minus(p):
return BoxInt(p[0].getint() - p[2].getint())