Search code examples
pythonexpressiongroupingparentheses

How to capture grouping in an expression?


I'm trying to record grouping in expressions, like:

condition_a = Condition( "x = 5 ")
condition_b = Condition( "y > 6 " )
condition_c = Condition( "z = 7 " )

condition_c & ( condition_a | condition_b )

These are expressions a user can write to configure a SQL query, so the above would ultimately be expressed as x = 5 AND (y > 6 OR z = 7).

I tried this:

class Condition( object ) :
    def __init__( self, expr ) :
        ....

    def __and__( self , other ) :
        print 'calling and'
        ....
        return self

    def __or__( self , other ) : 
        print 'calling or'
        ....
        return self

    def __call__( self , other ) :
        print 'calling ()'
        ....
        return self

I would expect this code :

condition_a = Condition( "x = 5 ")
condition_b = Condition( "y > 6 " )
condition_c = Condition( "z = 7 " )
condition = condition_c & ( condition_a | condition_b )
expected =  "z = 7 AND ( x = 5 OR y > 6 )" 

to print the calling () but it does not. I just see :

calling or
calling and

It seems I am not using a call, so something like condition_a(1), so __call__ is not called.

Is there a way to achieve what I want? You can check the expected which is what I want to have (and and or works, it's just I don't have the parenthesis around part of the expression).


Solution

  • No, the (...) parentheses in your expression are not a call expression. They are just there to group your expressions.

    Python does not preserve grouping parentheses. They only are used by the parser to build an abstract syntax tree, and grouping influences in what order expressions are executed.

    The expression

    A & B | C
    

    is parsed into a tree of BinOp objects, like this:

          |
        /   \
       &      C
     /   \
    A     B
    

    & binds more tightly here, so A & B is executed before the | operator is executed. That's because the | operator has a lower precedence than &.

    Parentheses tell the parser to build separate trees for everything between (...) before putting that into the tree, so the expression

    A & (B | C)
    

    becomes

       &
     /   \
    A     |
        /   \
      B       C
    

    The parentheses are gone, all that is left is the tree of expressions, and Python knows know in what order you need to execute those expressions; the | is executed first, on B and C, and the result of this expression is then used in the & operator.

    You can look at those AST objects with the ast module:

    >>> import ast
    >>> print(ast.dump(ast.parse('A & B | C').body[0].value)))
    BinOp(
        left=BinOp(
            left=Name(id='A', ctx=Load()),
            op=BitAnd(),
            right=Name(id='B', ctx=Load())
        ),
        op=BitOr(),
        right=Name(id='C', ctx=Load()))
    )
    >>> print(ast.dump(ast.parse('A & (B | C)').body[0].value))
    BinOp(
        left=Name(id='A', ctx=Load()),
        op=BitAnd(),
        right=BinOp(
            left=Name(id='B', ctx=Load()),
            op=BitOr(),
            right=Name(id='C', ctx=Load())
        )
    )
    

    (I unwrapped the outer Module and Expr objects to focus on the BinOp trees, and manually indented the nested objects).

    and you can see the order of operations change when you look at the resulting bytecode with the dis module:

    >>> dis.dis(compile('A & B | C', '', 'eval'))
      1           0 LOAD_NAME                0 (A)
                  2 LOAD_NAME                1 (B)
                  4 BINARY_AND
                  6 LOAD_NAME                2 (C)
                  8 BINARY_OR
                 10 RETURN_VALUE
    >>> dis.dis(compile('A & (B | C)', '', 'eval'))
      1           0 LOAD_NAME                0 (A)
                  2 LOAD_NAME                1 (B)
                  4 LOAD_NAME                2 (C)
                  6 BINARY_OR
                  8 BINARY_AND
                 10 RETURN_VALUE
    

    (LOAD_NAME puts a value on the stack, and BINARY_OR and BINARY_AND take the top 2 values from the stack and push back the result).

    All this means that parentheses cannot be retrieved this way. They are gone, they were only there to inform the order of operations.

    What you need to do then, is maintain the syntax tree; you can still reconstruct that from the execution. You could just always include parentheses when you then output your result:

    class Expr(object):
        def __or__(self, other):
            return BinOp(self, '|', other)
    
        def __and__(self, other):
            return BinOp(self, '&', other)
    
    class BinOp(Expr):
        def __init__(self, left, oper, right):
            self.left, self.oper, self.right = left, oper, right
    
        def __repr__(self):
            return "({} {} {})".format(self.left, self.oper, self.right)
    
    class Condition(Expr):
        def __init__(self, expr):
            self.expr = expr
    
        def __repr__(self):
            return self.expr
    

    When printing a BinOp object, you'll always get parentheses:

    >>> condition_a = Condition( "x = 5 ")
    >>> condition_b = Condition( "y > 6 " )
    >>> condition_c = Condition( "z = 7 " )
    >>> condition_c & ( condition_a | condition_b )
    (z = 7  | (x = 5  | y > 6 ))
    >>> condition_c & condition_a | condition_b
    ((z = 7  | x = 5 ) | y > 6 )
    

    This is fine, because the SQL parser treats parentheses the same way; just to group expressions.

    The next step is to add precedence information to your binary operators. You can then decide if you have to have parentheses around a sub-expression or not; only when the left sub-expression has a lower precedence, or the right has a lower or equal precedence, would you need them:

    # Condition stays the same
    
    class Expr(object):
        precedence = float('inf')
    
        def __or__(self, other):
            return BinOp(self, '|', other)
    
        def __and__(self, other):
            return BinOp(self, '&', other)
    
    class BinOp(Expr):
        def __init__(self, left, oper, right):
            self.left, self.oper, self.right = left, oper, right
            self.precedence = '|&'.index(oper)  # 0 or 1
    
        def __repr__(self):
            left, right = self.left, self.right
            if left.precedence < self.precedence:
                left = '({})'.format(left)
            if right.precedence <= self.precedence:
                right = '({})'.format(right)
            return "{} {} {}".format(left, self.oper, right)
    

    Now parentheses only show up when the expression requires them:

    >>> condition_c & ( condition_a | condition_b )
    z = 7  & (x = 5  | y > 6 )
    >>> condition_c & condition_a | condition_b
    z = 7  & x = 5  | y > 6
    >>> condition_c & (condition_a & condition_b)
    z = 7  & (x = 5  & y > 6 )
    

    If you are going to extend this system (adding in more operator types), you'll have to expand on the concept of precedence there; the above example only has 0 and 1 for the binary operators, but you'll have to naturally expand on these values if you have more.

    I'd also strongly urge you to look at SQLAlchemy, which already implements all this, and much more.