I have a dictionary containing a list of object as
objects = {'A1': obj_1,
'A2': obj_2,
}
I then have a string as
cmd = '(1.3A1 + 2(A2 + 0.7A3)) or 2(A4 to A6)'
I want to translate this to a command as
max( 1.3*objects['A1'] + 2*(objects['A2'] + 0.73*objects['A3']), 2*max(objects['A4'], objects['A5'], objects['A6']))
As I didn't find any better option, I started writing a parser from scratches.
PERSONAL NOTE: I don't think that attaching a 150-line code to a SO question is good practice as this will imply that the reader should read and understand it which is a demanding task. Nonetheless my previous question was downvoted because I didn't put my solution. So here you are...
import re
from more_itertools import stagger
def comb_to_py(string, objects):
# Split the line
toks = split_comb_string(string)
# Escape for empty string
if toks[0] == 'none':
return []
# initialize iterator
# I could use a deque here. Let's see what works the best
iterator = stagger(toks, offsets=range(2), longest=True)
return comb_it_to_py(iterator, objects)
def split_comb_string(string):
# Add whitespaces between tokes when they could be implicit to allow string
# splitting i.e. before/after plus (+), minus and closed bracket
string = re.sub(r' ?([\+\-)]) ?', r' \1 ', string)
# remove double spaces
string = re.sub(' +', ' ', string)
# Avoid situations as 'A1 + - 2A2' and replace them with 'A1 - 2A2'
string = re.sub(r'\+ *\-', r'-', string)
# Avoid situations as 'A1 - - 2A2' and replace them with 'A1 + 2A2'
string = re.sub(r'\- *\-', r'+', string)
# Add whitespace after "(" (we do not want to add it in front of it)
string = re.sub(r'\( ?', r'( ', string)
return string.strip().split(' ')
def comb_it_to_py(iterator, objects):
for items in iterator:
# item[0] is a case token (e.g. 1.2A3)
# This should occur only with the first element
if re.fullmatch(r'([\d.]*)([a-zA-Z(]+\d*)', items[0]) is not None:
res = parse_case(items[0], objects, iterator)
elif items[0] == ')' or items[0] is None:
return res
# plus (+)
elif items[0] == '+':
# skip one position
skip_next(iterator)
# add following item
res += parse_case(items[1], objects, iterator)
# minus (-)
elif items[0] == '-':
# skip one position
skip_next(iterator)
# add following item
res -= parse_case(items[1], objects, iterator)
else:
raise(ValueError(f'Invalid or misplaced token {items[0]}'))
return res
def parse_case(tok, objects, iterator):
# Translate a case string into an object.
# It handles also brackets as "cases" calling comb_it_to_py recursively
res = re.match(r'([\d.]*)(\S*)', tok)
if res[1] == '':
mult = 1
else:
mult = float(res[1])
if res[2] == '(':
return mult * comb_it_to_py(iterator, objects)
else:
return mult * objects[res[2]]
def skip_next(iterator):
try:
next(iterator)
except StopIteration:
pass
if __name__ == '__main__':
from numpy import isclose
def test(string, expected_result):
try:
res = comb_to_py(string, objects)
except Exception as e:
print(f"Error during test on '{string}'")
raise e
assert isclose(res.value, expected_result), f"Failed test on '{string}'"
objects = {'A1': 1, 'A2':2, 'A10':3}
test('A2', 2)
test('1.3A2', 2.6)
test('1.3A2 + 3A1', 5.6)
test('1.3A2+ 3A1', 5.6)
test('1.3A2 +3A1', 5.6)
test('1.3A2+3A1', 5.6)
test('1.3A2 - 3A1', -0.4)
test('1.3A2 -3A1', -0.4)
test('1.3A2- 3A1', -0.4)
test('1.3A2-3A1', -0.4)
test('1.3A2 + -3A1', -0.4)
test('1.3A2 +-3A1', -0.4)
test('1.3A2 - -3A1', 5.6)
test('A1 + 2(A2+A10)', 25)
test('A1 - 2(A2+A10)', -23)
test('2(A2+A10) + A1', 25)
test('2(A2+A10) - A1', 23)
test('2(A2+A10) - -A1', 25)
test('2(A2+A10) - -2A1', 26)
This code is not only lengthy, but also very easy to break. The whole code is based on the correct split of the string and the regex section is there only to be sure that the string is split correctly, which totally depend on the position of the whitespaces inside the string, even if - in this specific syntax - most whitespaces should not be parsed at all.
Moreover, this code still doesn't handle the or
keyword (where A or B
should translate into max(A,B)
and the to
keyword (where A1 to A9
should translate in max([Ai for Ai in range(A1, A9)])
).
Is this the best approach or is there a more robust way for this type of tasks?
I gave a look to pyparsing. It looks as a possibility, but, if I understood well, it should be used as a more robust "line-splitting", while the tokens would still have to be translated to an operation one by one manually. Is this correct?
Regular expressions are inherently unsuitable for a task involving parentheses for nested grouping – your pseudo-algebraic language (PAL) is not a regular language. An actual parser such as PyParsing (a PEG parser) should be used instead.
While this still requires translating from source code to operations, this can be performed directly during parsing.
We need a few language elements that directly translate to Python primitives:
1.3
, as int
/float
literals or fractions.Fraction
.A3
, as keys to the objects
namespace.(...)
, as grouping via parentheses for:
(1.3 or A3)
, as max
calls.A4 to A6
, as max
calls+
binary operator, as the +
binary operator.2(...)
, as 2 * (...)
.Such a simple language is equally suitable for a transpiler or interpreter – there are no side-effects or introspection, so a naive translation without first-class objects, intermediate representation or AST is fine.
For a transpiler, we need to transform from PAL source code to Python source code. We can use pyparsing
to directly read PAL and use a parse action to emit Python.
The simplest case are numbers – both PAL and Python source are identical. This is ideal to look at the general structure of transpiling:
import pyparsing as pp
# PAL grammar rule: one "word" of sign, digits, dot, digits
NUMBER = pp.Regex(r"-?\d+\.?\d*")
# PAL -> Python transformation: Compute appropriate Python code
@NUMBER.setParseAction
def translate(result: pp.ParseResults) -> str:
return result[0]
Note that setParseAction
is commonly used with a lambda
, instead of decorating a def
. The longer variant is easier to comment/annotate, however.
A name reference is similar to parse, but needs some minor translation to Python. We can still use regular expressions, since there is no nesting here either. All names will be keys to a single, global namespace that we arbitrarily call objects
.
NAME = pp.Regex(r"\w+\d+")
@NAME.setParseAction
def translate(result: pp.ParseResults) -> str:
return f'objects["{result[0]}"]' # interpolate key into namespace
Both grammar parts work independently for transpiling already. For example, NAME.parseString("A3")
provides the source code objects["A3"]
.
Unlike terminal/primitive grammar expressions, compound expressions must refer to other expressions, possibly themselves (at this point, regular expressions fail). PyParsing makes this simple with Forward
expressions – these are placeholders which are defined later on.
# placeholder for any valid PAL grammar element
EXPRESSION = pp.Forward()
Without operator precedence and just grouping via (...)
, all of +
, or
and to
work similar. We pick or
as a demonstrator.
The grammar gets more complicated now: We use pp.Suppress
to match but discard the purely syntactical (
/)
and or
. We use +
/-
to combine several grammar expressions (-
means there are no alternatives when parsing). Finally, we use the forward reference EXPRESSION
to refer to every other and this expression.
SOME_OR = pp.Suppress("(") + EXPRESSION + pp.OneOrMore(pp.Suppress("or") - EXPRESSION) - pp.Suppress(")")
@SOME_OR.setParseAction
def translate(result: pp.ParseResults) -> str:
elements = ', '.join(result)
return f"max({elements})"
Name ranges and addition work fundamentally the same, only the delimiter and output formatting changes. Implicit multiplication is simpler in that it works only on a pair of expressions.
At this point, we have a transpiler for each kind of language element. The missing rules can be created with the same approach. Now, we need to actually read source code and run the transpiled code.
We start by putting together the pieces that we have: inserting all grammar elements into the forward reference. We also provide a convenience function to abstract away PyParsing.
EXPRESSION << (NAME | NUMBER | SOME_OR)
def transpile(pal: str) -> str:
"""Transpile PAL source code to Python source code"""
return EXPRESSION.parseString(pal, parseAll=True)[0]
In order to run some code, we need to transpile the PAL code and evaluate the Python code with some namespace. Since our grammar only allows safe input, we can use eval
directly:
def execute(pal, **objects):
"""Execute PAL source code given some object values"""
code = transpile(pal)
return eval(code, {"objects": objects})
This function can be run with a given PAL source and name values to evaluate the equivalent Python value:
>>> execute("(A4 or A3 or 13)", A3=42, A4=7)
42
For complete support of PAL, define the missing compound rules and add them alongside the others to EXPRESSION
.