Search code examples
pyparsing

Parsing nested SQL queries with parenthesized predicates using pyparsing


I'm trying to parse nested queries of the form that contains predicates with parentheses. Example:

query = '(A LIKE "%.something.com" AND B = 4) OR (C In ("a", "b") AND D Contains "asdf")'

I've tried many of the answers/examples I've seen but without getting them to work, and this is what I have come up with so far that almost(?) works:

from pyparsing import *

r = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'*+,-./:;<=>?@[\]^_`{|}~'

any_keyword = CaselessKeyword("AND") | CaselessKeyword("OR")

non_keyword = ~any_keyword + Word(r)

expr = infixNotation(originalTextFor(non_keyword[1, ...]),
                     [
                         (oneOf("AND OR", caseless=True, asKeyword=True), 2, opAssoc.LEFT)
                     ])

Then, running expr.parseString(query).asList() returns only:

[['A LIKE "%.something.com"', 'AND', 'B = 4']]

without the rest of the query. As far as I understand, this is due to the C In ("a", "b") part, since there are parentheses there.

Is there a way to "disregard" parentheses inside the predicates so that parsing returns the expected answer:

[[['A LIKE "%.something.com"', 'AND', 'B = 4'], 'OR', ['C In ("a", "b")', 'AND', 'D Contains "asdf"']]]

Solution

  • Welcome to pyparsing! You've made some good headway from following other examples, but let's back up just a bit.

    infixNotation is definitely the right way to go here, but there are many more operators in this expression than just AND and OR. These are all sub-expressions in your input string:

    A LIKE "%.something.com"
    B = 4
    C in ("a", "b")
    D contains "asdf"
    

    Each of these is its own binary expression, with operators "LIKE", "=", "in" and "contains". Also, your operands can be not only identifiers, but quoted strings, a collection of quoted strings, and numbers.

    I like your intuition that this is a logical expression of terms, so let's define 2 levels of infixNotations:

    1. an expression of column or numeric "arithmetic" (using '=', "LIKE", etc.)
    2. an expression of terms defined by #1, combined with logical NOT, AND, and OR operators

    If we call #1 a column_expr, #2 would look similar to what you have already written:

    expr = infixNotation(column_expr,
                         [
                             (NOT, 1, opAssoc.RIGHT),
                             (AND, 2, opAssoc.LEFT),
                             (OR, 2, opAssoc.LEFT),
                         ])
    

    I've added NOT as a reasonable extension to what you have now - most logical expressions include these 3 operators. Also, it is conventional to define them in this order of precedence, so that "X OR Y AND Z" eventually evaluates in the order "X OR (Y AND Z)", since AND has higher precedence than OR (and NOT is higher still).

    #1 takes some more work, so I've written a little BNF for what we should expect for an individual operand for column_expr (I cannot recommend taking this step highly enough!):

    identifier ::= one or more Words composed of printable letters (we may come back to this)
    number ::= an integer or real number (we can use the one defined in pyparsing_common.number)
    quotedString ::= (a quoted string like one define by pyparsing)
    quotedString_list ::= '(' quotedString [',' quotedString]... ')'
    
    # put identifier last, since it will match just about anything, and we want to try the other
    # expressions first
    column_operand ::= quotedString | quotedString_list | number | identifier
    

    Then a column_expr will be an infixNotation using these column_operands:

    ppc = pyparsing_common
    LPAR, RPAR = map(Suppress, "()")
    # use Group to keep these all together
    quotedString_list = Group(LPAR + delimitedList(quotedString) + RPAR)
    
    column_operand = quotedString | quotedString_list | ppc.number | identifier
    column_expr = infixNotation(column_operand,
                                [
                                    (IN | CONTAINS, 2, opAssoc.LEFT),
                                    (LIKE, 2, opAssoc.LEFT),
                                    ('=', 2, opAssoc.LEFT),
                                ])
    

    If you find you have to add other operators like "!=", most likely you will add them in to column_expr.

    Some other notes:

    • You probably want to remove ' and " from r, since they should really be handled as part of the quoted strings

    • As your list of keywords grows, you will find it easier to define them using something like this:

        AND, OR, NOT, LIKE, IN, CONTAINS = keyword_exprs = list(map(CaselessKeyword, """
            AND OR NOT LIKE IN CONTAINS
            """.split()))
        any_keyword = MatchFirst(keyword_exprs)
      

      Then you can reference them more easily as I have done in the code above.

    • Write small tests first before trying to test the complex query you posted. Nice work in including many of the variations of operands. Then use runTests to run them all, like this:

        expr.runTests("""
                    A LIKE "%.something.com"
                    B = 4
                    C in ("A", "B")
                    D CONTAINS "ASDF"
                    (A LIKE "%.something.com" AND B = 4) OR (C In ("a", "b") AND D Contains "asdf")
                    """)
      

    With these changes, I get this for your original query string:

    [[['A', 'LIKE', '"%.something.com"'], 'AND', 'B = 4'], 'OR', [['C', 'IN', ['"a"', '"b"']], 'AND', ['D', 'CONTAINS', '"asdf"']]]
    

    Hmmm, I'm not keen on a term that looks like 'B = 4', now that we are actually parsing the sub expressions. I suspect it is because your definition of identifier is a little too aggressive. If we cut it back to just ~any_keyword + Word(alphas, r), forcing a leading alpha character and without the [1, ...] for repetition, then we get the better-looking:

    [[['A', 'LIKE', '"%.something.com"'], 'AND', ['B', '=', 4]], 'OR', [['C', 'IN', ['"a"', '"b"']], 'AND', ['D', 'CONTAINS', '"asdf"']]]
    

    If in fact you do want these sub-expressions to be retained as they were found in the original, and just break up on the logical operators, then you can just wrap column_expr in originalTextFor as you used before, giving:

    [['A LIKE "%.something.com"', 'AND', 'B = 4'], 'OR', ['C In ("a", "b")', 'AND', 'D Contains "asdf"']]
    

    Good luck with your SQL parsing project!