Search code examples
pythoncontext-free-grammarpyparsing

pyparsing variables that refer to each other


I am using pyparsing to write a small grammar, but I'm running into an issue where the definition of one variable requires me to use another variable that itself requires the first one, etc. This answer helped me some, since I am working with Prolog strings, but it didn't seem to address the issue I'm having.

Working example, without the issue

import pyparsing as pp

# grammar
argument = pp.Word(pp.alphas)
pred = pp.Word(pp.alphas) + (pp.Suppress('(') + pp.delimitedList(argument) + pp.Suppress(')'))


# tests
test_string1="func(x, y)"

test_result1 = pred.parse_string(test_string1)
print(test_result1.dump())

Broken example, with the issue

import pyparsing as pp

# grammar
argument = pp.Word(pp.alphas) | pred
pred = pp.Word(pp.alphas) + (pp.Suppress('(') + pp.delimitedList(argument) + pp.Suppress(')'))


# tests
test_string2="func(func(x), y)"

test_result2 = pred.parse_string(test_string2)
print(test_result2.dump())

The idea is that a predication might have as one of its arguments another predication which itself might have more arguments or predications etc. However, in the first line of the grammar in the broken example, the code breaks because it is trying to reference pred, which hasn't been assigned yet. I can switch the order, but the problem will persist, because pred requires the usage of argument since a pred can have an argument.

On paper I think it should work, since eventually it would terminate, but because of the fact that I have to declare things in a certain order it doesn't seem to work in code form.


Solution

  • This is a common issue with grammars, where there is some recursion in the definition of things.

    I often encourage parser developers to take a moment and write up a BNF before diving into writing code. This is good practice whether you are using pyparsing or any other parsing library. Here is one that I came up with, based on your code:

    BNF:
        pred ::= identifier '(' args ')'
        args ::= argument [, argument]...
        argument ::= identifier | pred
        identifier ::= word of one or more alphas
    

    (Jumping ahead, pyparsing will create a railroad diagram of the BNF for you.) enter image description here

    Now you can do a mental walkthrough of your test string using this BNF. Looking at your string "func(func(x), y)", I have some questions about this BNF:

    1. Do preds always have arguments? (Should the contained args be optional?)
    2. Defining argument as identifier | pred may cause some ambiguity when parsing with pyparsing, since pred also starts with an identifier. We will either need to use '^' operator for pyparsing (which does a longest match), or reorder to match pred | identifier so that the more complex expression is tried before the simpler one.

    You rightly determine that there is recursion in this grammar. Pyparsing lets you define a "I need to use this now but I'll define it later" term using the Forward class (from the "forward" declaration in Pascal).

    pred = pp.Forward()
    

    Then translating this to pyparsing (working bottom-up) looks like:

    identifier = pp.Word(pp.alphas)
    argument = pred | identifier
    args = pp.delimitedList(argument)
    

    And then to insert the definition into the existing pred term, use the <<= operator instead of =:

    pred <<= identifier + pp.Suppress("(") + args + pp.Suppress(")")
    

    I got so tired of writing test loops for test inputs to parsers that I wrote the run_tests() method, which you would call like this:

    pred.run_tests("""\
        func(func(x), y)
        """)
    

    Which echoes the input string and then dumps the output:

    func(func(x), y)
    ['func', 'func', 'x', 'y']
    

    So a couple more notes:

    1. To make the arguments optional, wrap in a pp.Opt()

    2. To preserve structure, wrap logical groups of terms in pp.Group. I would suggest wrapping both pred and the '('+args+')' part of pred:

       pred <<= pp.Group(identifier + pp.Group(pp.Suppress("(") + args + pp.Suppress(")")))
      

    Which will then give this run_tests output:

    func(func(x), y)
    [['func', [['func', ['x']], 'y']]]
    [0]:
      ['func', [['func', ['x']], 'y']]
      [0]:
        func
      [1]:
        [['func', ['x']], 'y']
        [0]:
          ['func', ['x']]
          [0]:
            func
          [1]:
            ['x']
        [1]:
          y