Search code examples
pythonpyparsinginfix-notation

pyparsing infixNotation optimization


My implementation of infixNotation is running slower than I would like even after using enablePackrat, which greatly increased performance.

Parsing needs to recognize and parse the following types of strings:

  • Basic arithmetic operations, numbers, negation, and parentheses groupings
  • Groupings in the format prefix::dotted.alphanum.string -> [prefix::dotted.alphanum.string]
  • Strings that look like function calls e.g. pow(some::var + 2.3, 5) -> [pow, [[some::var, +, 2.3], 5]]

The code I'm using:

def parse_expression(expr_str):

    fraction = Combine("." + Word(nums))
    number = Combine(Word(nums) + Optional(fraction)).setParseAction(str_to_num)

    event_id_expr = Word(alphanums + "_") + "::"
    dotted_columns = Combine(Word(alphanums + "_") + Optional("."))

    column_expr = Combine(event_id_expr + OneOrMore(dotted_columns))

    arith_expr = infixNotation(column_expr | number, [
        (Word(alphanums + "_"), 1, opAssoc.RIGHT),
        ("-", 1, opAssoc.RIGHT),
        (oneOf("* /"), 2, opAssoc.LEFT),
        (oneOf("+ -"), 2, opAssoc.LEFT),
        (Literal(","), 2, opAssoc.LEFT)
    ])

    parsed_expr = arith_expr.parseString(expr_str).asList()[0]

    return parsed_expr

 def str_to_num(t):
      num_str = t[0]
      try:
          return int(num_str)
      except ValueError:
          return float(num_str)

Are there any changes I can make that would result in substantial performance improvements? The structures I'm parsing are fairly simple, but they're in batches. On average each string is taking ~5.3ms.


Solution

  • It looks like you are "fudging" the functions as if they are operators, I think you are better off moving function calls into the operand expression for infixNotation:

    def parse_expression(expr_str):
    
        number = pyparsing_common.number()
    
        event_id_expr = Word(alphas+"_", alphanums + "_") + "::"
        dotted_columns = Combine(Word(alphas+"_", alphanums + "_") + Optional("."))
    
        column_expr = Combine(event_id_expr + OneOrMore(dotted_columns))
    
        func_name = Word(alphas+"_", alphanums+'_')
        LPAR, RPAR = map(Suppress, "()")
        arith_expr = Forward()
        func_call = Group(func_name('name') 
                          + LPAR 
                          + Group(Optional(delimitedList(arith_expr)))("args") 
                          + RPAR)
    
        arith_expr <<= infixNotation(number | func_call | column_expr, [
            ("-", 1, opAssoc.RIGHT),
            (oneOf("* /"), 2, opAssoc.LEFT),
            (oneOf("+ -"), 2, opAssoc.LEFT),
        ])
    
        parsed_expr = arith_expr.parseString(expr_str)[0]
    
        return parsed_expr
    

    I also modified most of your identifiers to use the two-argument form of Word - just using Word(alphanums+"_") would also match ordinary integers, which I don't think is your intent. If I got this wrong, then just put these back as you had them.