Search code examples
pythonpyparsing

In pyparsing, can I treat whitespace as a token when certain conditions are met when using infixNotation?


I'm trying to use pyparsing==2.4.7 to parse search queries that have a field:value format.

Examples of the strings I want to parse include:

field1:value1
field1:value1 field2:value2
field1:value1 AND field2:value2
(field1:value1a OR field1:value1b) field2:value2
(field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)

A few things to note:

  • I'm using OR and | to both mean "OR", same with AND and & meaning the same thing
  • If there is no boolean operator between conditions, then an AND is implied
  • Queries can be nested hierarchically with parentheses
  • The values (on the right side of the :) will never have spaces

I have written a parser that works (code is based on this SO answer), but only for when all of the operators are present (AND and OR):

import pyparsing as pp
from pyparsing import Word, alphas, alphanums, White, Combine, OneOrMore, Literal, oneOf 

field_name = Word(alphanums).setResultsName('field_name')

search_value = Word(alphanums + '-').setResultsName('search_value')

operator = Literal(':')

query = field_name + operator + search_value

AND = oneOf(['AND', 'and', '&', ' '])
OR = oneOf(['OR', 'or', '|'])
NOT = oneOf(['NOT', 'not', '!'])

query_expr = pp.infixNotation(query, [
    (NOT, 1, pp.opAssoc.RIGHT, ),
    (AND, 2, pp.opAssoc.LEFT, ),
    (OR, 2, pp.opAssoc.LEFT, ),
])

class ComparisonExpr:
    def __init__(self, tokens):
        self.tokens = tokens
    def __str__(self):
        return "Comparison:('field': {!r}, 'operator': {!r}, 'value': {!r})".format(*self.tokens)
    def __repr__(self):
        return self.__str__()

query.addParseAction(ComparisonExpr)

sample = "(field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)"

result = query_expr.parseString(sample).asList()

from pprint import pprint
>>> pprint(result)

[[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'),
   '|',
   Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')],
  '&',
  [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'),
   '|',
   Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]]

However, if I try it with a sample that is missing a operator, the parser appears to stop at the point where an operator would be expected:

sample = "(field1:value1a | field1:value1b) (field2:value2a | field2:value2b)"

result = query_expr.parseString(sample).asList()
from pprint import pprint
pprint(result)

[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'),
  '|',
  Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')]]

Is there a way to make whitespace an "implicit AND" if there is no operator separating terms?


Solution

  • Short answer:

    Replace your definition of AND with:

    AND = oneOf(['AND', 'and', '&']) | pp.Empty()
    

    Some other suggestions:

    For easier post-parse processing, you may want the Empty() to actually emit a "&" operator. You can do that with a parse action:

    AND = oneOf(['AND', 'and', '&']) | pp.Empty().addParseAction(lambda: "&")
    

    In fact, you can normalize all your operators to just "&", "|", and "!", again, to skip any "if operator == 'AND' or operator == 'and' or ..." code. Put your parse action on the whole expression:

    AND = (oneOf(['AND', 'and', '&']) | pp.Empty()).addParseAction(lambda: "&")
    OR = oneOf(['OR', 'or', '|']).addParseAction(lambda: "|")
    NOT = oneOf(['NOT', 'not', '!']).addParseAction(lambda: "!")
    

    Also, considering that you are now accepting "" as equivalent to "&", you should make pyparsing treat your operators like keywords - so there is no confusion if "oregon" is not "or egon". Add the asKeyword argument to all your oneOf expressions:

    AND = (oneOf(['AND', 'and', '&'], asKeyword=True)
           | pp.Empty()).addParseAction(lambda: "&")
    OR = oneOf(['OR', 'or', '|'], asKeyword=True).addParseAction(lambda: "|")
    NOT = oneOf(['NOT', 'not', '!'],  asKeyword=True).addParseAction(lambda: "!")
    

    Lastly, when you want to write test strings, you can skip the looping over strings, or catching ParseExceptions - just use runTests:

    query_expr.runTests("""\
        (field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)
        (field1:value1a | field1:value1b) (field2:value2a | field2:value2b)
        """)
    

    Will print each test string, followed by the parsed results or the parse exception and '^' where the exception occurred:

    (field1:value1a | field1:value1b) & (field2:value2a | field2:value2b)
    [[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]]
    [0]:
      [[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]
      [0]:
        [Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')]
      [1]:
        &
      [2]:
        [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]
    
    (field1:value1a | field1:value1b) (field2:value2a | field2:value2b)
    [[[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]]
    [0]:
      [[Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')], '&', [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]]
      [0]:
        [Comparison:('field': 'field1', 'operator': ':', 'value': 'value1a'), '|', Comparison:('field': 'field1', 'operator': ':', 'value': 'value1b')]
      [1]:
        &
      [2]:
        [Comparison:('field': 'field2', 'operator': ':', 'value': 'value2a'), '|', Comparison:('field': 'field2', 'operator': ':', 'value': 'value2b')]