Search code examples
pythonparsingpyparsing

How to parse an optional operator with pyparsing library?


I want to parse strings like alpha OR beta gamma where a missing operator (in this case between beta and gamma a implicit AND is used.

Here is the code I tried:

import pyparsing as pp


class Term:
    def __init__(self, tokens: pp.ParseResults):
        self.value = str(tokens[0])

    def __repr__(self) -> str:
        return f"Term({self.value})"


class BinaryOp:
    def __init__(self, tokens: pp.ParseResults) -> None:
        self.op = tokens[0][1]
        self.left = tokens[0][0]
        self.right = tokens[0][2]

    def __repr__(self) -> str:
        return f"BinaryOp({self.op}, {self.left}, {self.right})"


and_ = pp.Keyword("AND")
or_ = pp.Keyword("OR")
word = (~(and_ | or_) + pp.Word(pp.alphanums + pp.alphas8bit + "_")).set_parse_action(Term)

expression = pp.infix_notation(
    word,
    [
        (pp.Optional(and_), 2, pp.opAssoc.LEFT, BinaryOp),
        (or_, 2, pp.opAssoc.LEFT, BinaryOp),
    ],
)

input_string = "alpha OR beta gamma"
parsed_result = expression.parseString(input_string)
print(parsed_result.asList())

The output is [BinaryOp(OR, Term(alpha), Term(beta))], so the implicit AND and gamma is not parsed. How could I fix this?


Solution

  • This was a really interesting case that you turned up, thank you!

    First, the solution to your problem. To diagnose your parser's behavior, let's wrap your BinaryOp classes in trace_parse_action:

    expression = pp.infix_notation(
        word,
        [
            (pp.Optional(and_), 2, pp.opAssoc.LEFT, pp.trace_parse_action(BinaryOp)),
            (or_, 2, pp.opAssoc.LEFT, pp.trace_parse_action(BinaryOp)),
        ],
    )
    

    If we parse a known working string "alpha OR beta", we get this output:

    >>entering BinaryOp(line: 'alpha OR beta', 0, ParseResults([ParseResults([Term(alpha), 'OR', Term(beta)], {})], {}))
    <<leaving BinaryOp (ret: BinaryOp(OR, Term(alpha), Term(beta)))
    [BinaryOp(OR, Term(alpha), Term(beta))]
    

    trace_parse_action echoes the inbound and outbound values for a parse action.

    Now let's try a failing string "alpha beta":

    >>entering BinaryOp(line: 'alpha beta', 0, ParseResults([ParseResults([Term(alpha), Term(beta)], {})], {}))
    <<leaving BinaryOp (exception: list index out of range)
    [Term(alpha)]
    

    What list index? Ah! It's the list indexing you are doing in BinaryOp:

        self.op = tokens[0][1]
        self.left = tokens[0][0]
        self.right = tokens[0][2]
    

    The inbound ParseResults only contains 2 elements, but you are trying to process 3. You can fix this by having the Optional(_and) term supply a default "AND" value:

        (pp.Optional(and_, default="AND"), 2, pp.opAssoc.LEFT, pp.trace_parse_action(BinaryOp)),
    

    And now, even when no operator is given, you still get 3 tokens passed into the parse action:

    >>entering BinaryOp(line: 'alpha beta', 0, ParseResults([ParseResults([Term(alpha), 'AND', Term(beta)], {})], {}))
    <<leaving BinaryOp (ret: BinaryOp(AND, Term(alpha), Term(beta)))
    [BinaryOp(AND, Term(alpha), Term(beta))]
    

    So the simplest solution is to add the default "AND" value for your AND operator and you are good to go.

    Now for the interesting parts. Let's back out that fix and see what we got from pyparsing. Why did we get that exception, and why didn't pyparsing detect it? First off, the message we got from trace_parse_action could have been a little better. Instead of:

    <<leaving BinaryOp (exception: list index out of range)
    

    It would have been better to see:

    <<leaving BinaryOp (exception: IndexError: list index out of range)
    

    (and I am adding this as a bug to be fixed in the next pyparsing release).

    If we change your code to do tuple unpacking instead of explicit indexing,

        self.left, self.op, self.right = tokens[0]
    

    Now pyparsing breaks in a more self-explanatory way (once I fix trace_parse_action to also emit the exception type):

    >>entering BinaryOp(line: 'alpha beta', 0, ParseResults([ParseResults([Term(alpha), Term(beta)], {})], {}))
    <<leaving BinaryOp (exception: ValueError: not enough values to unpack (expected 3, got 2))
    Traceback (most recent call last):
        ... traceback follows ...
    

    So pyparsing will fail if a parse action raises ValueError, but not if it raises IndexError - what's up with that?

    It turns out that pyparsing internally catches IndexError, assuming that it is raised because a parser has run off the end of the input string, and treats it like a ParseException. Handling parse actions in this way is intentional, it allows developers to write a parse action that does additional validation and raise a ParseException if it fails. But this also means that IndexError gets "swallowed", and in your case, it wasn't something you raised intentionally, and it would have been nice if pyparsing had detected that. I'll see if I can make it a little smarter - thanks for posing this question!