Parsing pseudo-algebraic string into command

I have a dictionary containing a list of object as

objects = {'A1': obj_1,
    'A2': obj_2,
    }

I then have a string as

cmd = '(1.3A1 + 2(A2 + 0.7A3)) or 2(A4 to A6)'

I want to translate this to a command as

max( 1.3*objects['A1'] + 2*(objects['A2'] + 0.73*objects['A3']), 2*max(objects['A4'], objects['A5'], objects['A6']))

My try

As I didn't find any better option, I started writing a parser from scratches.

PERSONAL NOTE: I don't think that attaching a 150-line code to a SO question is good practice as this will imply that the reader should read and understand it which is a demanding task. Nonetheless my previous question was downvoted because I didn't put my solution. So here you are...

import re
from more_itertools import stagger

def comb_to_py(string, objects):

    # Split the line
    toks = split_comb_string(string)

    # Escape for empty string
    if toks[0] == 'none':
        return []

    # initialize iterator
    # I could use a deque here. Let's see what works the best
    iterator = stagger(toks, offsets=range(2), longest=True)

    return comb_it_to_py(iterator, objects)


def split_comb_string(string):

    # Add whitespaces between tokes when they could be implicit to allow string
    # splitting i.e. before/after plus (+), minus and closed bracket
    string = re.sub(r' ?([\+\-)]) ?', r' \1 ', string)

    # remove double spaces
    string = re.sub(' +', ' ', string)

    # Avoid situations as 'A1 + - 2A2' and replace them with 'A1 - 2A2'
    string = re.sub(r'\+ *\-', r'-', string)
    # Avoid situations as 'A1 - - 2A2' and replace them with 'A1 + 2A2'
    string = re.sub(r'\- *\-', r'+', string)

    # Add whitespace after "(" (we do not want to add it in front of it)
    string = re.sub(r'\( ?', r'( ', string)

    return string.strip().split(' ')


def comb_it_to_py(iterator, objects):

    for items in iterator:

        # item[0] is a case token (e.g. 1.2A3)
        # This should occur only with the first element
        if re.fullmatch(r'([\d.]*)([a-zA-Z(]+\d*)', items[0]) is not None:
            res = parse_case(items[0], objects, iterator)


        elif items[0] == ')' or items[0] is None:
            return res


        # plus (+)
        elif items[0] == '+':
            # skip one position
            skip_next(iterator)

            # add following item
            res += parse_case(items[1], objects, iterator)


        # minus (-)
        elif items[0] == '-':
            # skip one position
            skip_next(iterator)

            # add following item
            res -= parse_case(items[1], objects, iterator)

        else:
            raise(ValueError(f'Invalid or misplaced token {items[0]}'))

    return res

def parse_case(tok, objects, iterator):
    # Translate a case string into an object.
    # It handles also brackets as "cases" calling comb_it_to_py recursively
    res = re.match(r'([\d.]*)(\S*)', tok)

    if res[1] == '':
        mult = 1
    else:
        mult = float(res[1])

    if res[2] == '(':
        return mult * comb_it_to_py(iterator, objects)
    else:
        return mult * objects[res[2]]


def skip_next(iterator):
    try:
        next(iterator)
    except StopIteration:
        pass


if __name__ == '__main__':

    from numpy import isclose
    def test(string, expected_result):
        try:
            res = comb_to_py(string, objects)
        except Exception as e:
            print(f"Error during test on '{string}'")
            raise e

        assert isclose(res.value, expected_result), f"Failed test on '{string}'"


    objects = {'A1': 1, 'A2':2, 'A10':3}

    test('A2', 2)
    test('1.3A2', 2.6)

    test('1.3A2 + 3A1', 5.6)
    test('1.3A2+ 3A1', 5.6)
    test('1.3A2 +3A1', 5.6)
    test('1.3A2+3A1', 5.6)

    test('1.3A2 - 3A1', -0.4)
    test('1.3A2 -3A1', -0.4)
    test('1.3A2- 3A1', -0.4)
    test('1.3A2-3A1', -0.4)

    test('1.3A2 + -3A1', -0.4)
    test('1.3A2 +-3A1', -0.4)
    test('1.3A2 - -3A1', 5.6)

    test('A1 + 2(A2+A10)', 25)
    test('A1 - 2(A2+A10)', -23)

    test('2(A2+A10) + A1', 25)
    test('2(A2+A10) - A1', 23)
    test('2(A2+A10) - -A1', 25)
    test('2(A2+A10) - -2A1', 26)

This code is not only lengthy, but also very easy to break. The whole code is based on the correct split of the string and the regex section is there only to be sure that the string is split correctly, which totally depend on the position of the whitespaces inside the string, even if - in this specific syntax - most whitespaces should not be parsed at all.

Moreover, this code still doesn't handle the or keyword (where A or B should translate into max(A,B) and the to keyword (where A1 to A9 should translate in max([Ai for Ai in range(A1, A9)])).

Question

Is this the best approach or is there a more robust way for this type of tasks?

Note

I gave a look to pyparsing. It looks as a possibility, but, if I understood well, it should be used as a more robust "line-splitting", while the tokens would still have to be translated to an operation one by one manually. Is this correct?

Solution

Regular expressions are inherently unsuitable for a task involving parentheses for nested grouping – your pseudo-algebraic language (PAL) is not a regular language. An actual parser such as PyParsing (a PEG parser) should be used instead.

While this still requires translating from source code to operations, this can be performed directly during parsing.

We need a few language elements that directly translate to Python primitives:

Number literals, such as 1.3, as int/float literals or fractions.Fraction.
Name references, such as A3, as keys to the objects namespace.
Parentheses, such as (...), as grouping via parentheses for:
- Variants, such as (1.3 or A3), as max calls.
- Name ranges, such as A4 to A6, as max calls
- The + binary operator, as the + binary operator.
Implicit multiplication, such as 2(...), as 2 * (...).

Such a simple language is equally suitable for a transpiler or interpreter – there are no side-effects or introspection, so a naive translation without first-class objects, intermediate representation or AST is fine.

For a transpiler, we need to transform from PAL source code to Python source code. We can use pyparsing to directly read PAL and use a parse action to emit Python.

Primitive Expressions

The simplest case are numbers – both PAL and Python source are identical. This is ideal to look at the general structure of transpiling:

import pyparsing as pp

# PAL grammar rule: one "word" of sign, digits, dot, digits
NUMBER = pp.Regex(r"-?\d+\.?\d*")

# PAL -> Python transformation: Compute appropriate Python code
@NUMBER.setParseAction
def translate(result: pp.ParseResults) -> str:
    return result[0]

Note that setParseAction is commonly used with a lambda, instead of decorating a def. The longer variant is easier to comment/annotate, however.

A name reference is similar to parse, but needs some minor translation to Python. We can still use regular expressions, since there is no nesting here either. All names will be keys to a single, global namespace that we arbitrarily call objects.

NAME = pp.Regex(r"\w+\d+")

@NAME.setParseAction
def translate(result: pp.ParseResults) -> str:
    return f'objects["{result[0]}"]'   # interpolate key into namespace

Both grammar parts work independently for transpiling already. For example, NAME.parseString("A3") provides the source code objects["A3"].

Compound Expressions

Unlike terminal/primitive grammar expressions, compound expressions must refer to other expressions, possibly themselves (at this point, regular expressions fail). PyParsing makes this simple with Forward expressions – these are placeholders which are defined later on.

# placeholder for any valid PAL grammar element
EXPRESSION = pp.Forward()

Without operator precedence and just grouping via (...), all of +, or and to work similar. We pick or as a demonstrator.

The grammar gets more complicated now: We use pp.Suppress to match but discard the purely syntactical (/) and or. We use +/- to combine several grammar expressions (- means there are no alternatives when parsing). Finally, we use the forward reference EXPRESSION to refer to every other and this expression.

SOME_OR = pp.Suppress("(") + EXPRESSION + pp.OneOrMore(pp.Suppress("or") - EXPRESSION) - pp.Suppress(")")

@SOME_OR.setParseAction
def translate(result: pp.ParseResults) -> str:
    elements = ', '.join(result)
    return f"max({elements})"

Name ranges and addition work fundamentally the same, only the delimiter and output formatting changes. Implicit multiplication is simpler in that it works only on a pair of expressions.

At this point, we have a transpiler for each kind of language element. The missing rules can be created with the same approach. Now, we need to actually read source code and run the transpiled code.

We start by putting together the pieces that we have: inserting all grammar elements into the forward reference. We also provide a convenience function to abstract away PyParsing.

EXPRESSION << (NAME | NUMBER | SOME_OR)

def transpile(pal: str) -> str:
    """Transpile PAL source code to Python source code"""
    return EXPRESSION.parseString(pal, parseAll=True)[0]

In order to run some code, we need to transpile the PAL code and evaluate the Python code with some namespace. Since our grammar only allows safe input, we can use eval directly:

def execute(pal, **objects):
    """Execute PAL source code given some object values"""
    code = transpile(pal)
    return eval(code, {"objects": objects})

This function can be run with a given PAL source and name values to evaluate the equivalent Python value:

>>> execute("(A4 or A3 or 13)", A3=42, A4=7)
42

For complete support of PAL, define the missing compound rules and add them alongside the others to EXPRESSION.