Search code examples
pythonpyparsing

parsing a complex logical expression in pyparsing in a binary tree fashion


I am trying to parse complex logical expression like the one below;

x > 7 AND x < 8 OR x = 4

and get the parsed string as a binary tree. For the above expression the expected parsed expression should look like

[['x', '>', 7], 'AND', [['x', '<', 8], 'OR', ['x', '=', 4]]]

'OR' logical operator has higher precedence than 'AND' operator. Parenthesis can override the default precedence. To be more general, the parsed expression should look like;

<left_expr> <logical_operator> <right_expr>

Another example would be

input_string = x > 7 AND x < 8 AND x = 4
parsed_expr  = [[['x', '>', 7], 'AND', ['x', ',', 8]], 'AND', ['x', '=', 4]]

So far i came up with this simple solution which sadly cannot generate parsed expression in binary tree fashion. operatorPrecedence doesn't seem to have help me here where there is same logical operator consecutively as in previous example.

import pyparsing as pp
complex_expr = pp.Forward()
operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
logical = (pp.Keyword("AND") | pp.Keyword("OR")).setName("logical")
vars = pp.Word(pp.alphas, pp.alphanums + "_") | pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
condition = (vars + operator + vars)
clause = pp.Group(condition ^ (pp.Suppress("(") + complex_expr + pp.Suppress(")") ))

expr = pp.operatorPrecedence(clause,[
                            ("OR", 2, pp.opAssoc.LEFT, ),
                            ("AND", 2, pp.opAssoc.LEFT, ),])

complex_expr << expr
print complex_expr.parseString("x > 7 AND x < 8 AND x = 4")

Any suggestions or guidance is well appreciated.

BNF for the expression (without parenthesis) could be

<expr>       -> <expr> | <expr> <logical> <expr>
<expr>       -> <opnd> <relational> <opnd>
<opnd>       -> <variable> | <numeric>
<relational> -> <'>'> | <'='> | <'>='> | <'<='> | <'!='>

Solution

  • NOTE: the operatorPrecedence method of pyparsing is deprecated in favor of the method name infixNotation.

    Try changing:

    expr = pp.operatorPrecedence(clause,[ 
                                ("OR", 2, pp.opAssoc.LEFT, ), 
                                ("AND", 2, pp.opAssoc.LEFT, ),]) 
    

    to:

    expr = pp.operatorPrecedence(condition,[ 
                                ("OR", 2, pp.opAssoc.LEFT, ), 
                                ("AND", 2, pp.opAssoc.LEFT, ),]) 
    

    The first argument to operatorPrecedence is the primitive operand to be used with the operators - there is no need to include your complexExpr in parentheses - operatorPrecedence will do that for you. Since your operand is actually another deeper comparison, you might consider changing:

    condition = (expr + operator + expr)
    

    to:

    condition = pp.Group(expr + operator + expr)
    

    so that the output of operatorPrecedence is easier to process. With these changes, parsing x > 7 AND x < 8 OR x = 4 gives:

    [[['x', '>', '7'], 'AND', [['x', '<', '8'], 'OR', ['x', '=', '4']]]]
    

    which recognizes OR's higher precedence and groups it first. (Are you sure you want this order of AND and OR precedence? I think the traditional ordering is the reverse, as shown in this wikipedia entry.)

    I think you are also asking why pyparsing and operatorPrecedence does not return the results in nested binary pairs, that is, you expect parsing "A and B and C" would return:

    [['A', 'and', 'B'] 'and', 'C']
    

    but what you get is:

    ['A', 'and', 'B', 'and', 'C']
    

    That is because operatorPrecedence parses repeated operations at the same precedence level using repetition, not recursion. See this question which is very similar to yours, and whose answer includes a parse action to convert your repetitive parse tree to the more traditional binary parse tree. You can also find a sample boolean expression parser implemented using operatorPrecedence on the pyparsing wiki page.

    EDIT: To clarify, this is what I recommend you reduce your parser to:

    import pyparsing as pp
    
    operator = pp.Regex(">=|<=|!=|>|<|=").setName("operator")
    number = pp.Regex(r"[+-]?\d+(:?\.\d*)?(:?[eE][+-]?\d+)?")
    identifier = pp.Word(pp.alphas, pp.alphanums + "_")
    comparison_term = identifier | number 
    condition = pp.Group(comparison_term + operator + comparison_term)
    
    expr = pp.operatorPrecedence(condition,[
                                ("AND", 2, pp.opAssoc.LEFT, ),
                                ("OR", 2, pp.opAssoc.LEFT, ),
                                ])
    
    print expr.parseString("x > 7 AND x < 8 OR x = 4")
    

    If support for NOT might also be something you want to add, then this would look like:

    expr = pp.operatorPrecedence(condition,[
                                ("NOT", 1, pp.opAssoc.RIGHT, ),
                                ("AND", 2, pp.opAssoc.LEFT, ),
                                ("OR", 2, pp.opAssoc.LEFT, ),
                                ])
    

    At some point, you may want to expand the definition of comparison_term with a more complete arithmetic expression, defined with its own operatorPrecedence definition. I would suggest doing it this way, rather than creating one monster opPrec definition, as you have already alluded to some of the performance downsides to opPrec. If you still get performance issues, look into ParserElement.enablePackrat.