Search code examples
pythoncontext-free-grammarpyparsing

Parsing enumerations with CFG


I have a string that was generated from a rich text enumeration, for example:

"(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure"

I want to construct the original structure, for example:

{"Let X denote one of the following:" : {"weight":{}, "height":{}, "depth":{}} , 
"Y denote": {"color, except": {"white":{}, "blue":{}}}, "pressure":{} }

It is very apparent that this is context-free-grammar, however I had trouble implementing it pyparsing

Edit

I am no expert in CFG, so I hope this BNF representations is correct:

Assuming the following:

  1. w is the equivalent of any character (re.compile("\w*"))
  2. l is equivalent to re.compile("[a-z]")
  3. d is equivalent to `re.compile("\d+")
  4. r is equivalent to roman numerals (i,ii,iii ,...)

Then (hopefully), the BNF should look something like this

<E1>::= "(" <d> ")" | <E1> " "
<E2>::= "(" <l> ")" | <E2> " "
<E3>::= "(" <r> ")" | <E3> " "
<L0>::= <w> | <w> <E1> <L1> <L0>
<L1>::= <w> | <w> <E2> <L2> <L1>
<L2>::= <w> | <w> <E3> <L2>

Solution

  • Here is the first part of your parser using pyparsing expressions:

    import pyparsing as pp
    
    LPAR, RPAR = map(pp.Suppress, "()")
    COMMA, COLON = map(pp.Literal, ",:")
    wd = pp.Word(pp.alphas)
    letter = pp.oneOf(list(pp.alphas.lower())) 
    integer = pp.pyparsing_common.integer
    roman = pp.Word('ivx')
    
    e1 = LPAR + integer + RPAR
    e2 = LPAR + letter + RPAR
    e3 = LPAR + roman + RPAR
    

    The next part, based on your BNF, might look like:

    # predefine levels using Forwards, since they are recursive
    lev0 = pp.Forward()
    lev1 = pp.Forward()
    lev2 = pp.Forward()
    
    lev0 <<= wd | wd + e1 + lev1 + lev0
    lev1 <<= wd | wd + e2 + lev2 + lev1
    lev2 <<= wd | wd + e3 + lev2
    

    I assume that lev0 is supposed to parse your test string being the 0'th level of nesting.

    As I mentioned in the comments to your question, this fails immediately since your test string starts with "(1)" but your levels don't start with any of the e expressions.

    Before continuing, your BNF implements repetition in a classic BNF form:

    e ::= some expression
    list_of_e ::= e (list_of_e | empty)
    

    In pyparsing, you would implement this a little more directly:

    wd = pp.Word(pp.alphas)
    list_of_wd = pp.OneOrMore(wd)
    # or using tuple multiplication short-hand
    list_of_wd = wd * (1,)
    

    Looking at your example, I rewrote the levels of your BNF as:

    wds = pp.Group(wd*(1,))
    lev0 <<= e1 + wds + lev1*(0,)
    lev1 <<= e2 + wds + lev2*(0,)
    lev2 <<= e3 + wds
    expr = lev0()*(1,)
    expr.ignore(COMMA | COLON)
    

    (I didn't see the commas or colons helping in the parsing, so I just ignore them.)

    Using expr to parse your string:

    tests = """\
    (1) Y denote (a) color (b) pressure
    (1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
    """
    
    for test in tests.splitlines():
        print(test)
        expr.parseString(test).pprint()
        print()
    

    We get:

    (1) Y denote (a) color (b) pressure
    [1, ['Y', 'denote'], 'a', ['color'], 'b', ['pressure']]
    
    (1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
    [1,
     ['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
     'a',
     ['weight'],
     'b',
     ['height'],
     'c',
     ['depth'],
     2,
     ['Y', 'denote'],
     'a',
     ['color', 'except'],
     'i',
     ['white'],
     'ii',
     ['blue'],
     'b',
     ['pressure']]
    

    So it parsed, in the sense that it got through the whole input string, but all we've done is basic tokenizing, and have not represented any of the structure implied by your integer/alpha/roman nested lists.

    Pyparsing includes a grouping class to structure the results:

    G = pp.Group
    wds = G(wd*(1,))
    lev0 <<= G(e1 + G(wds + lev1*(0,)))
    lev1 <<= G(e2 + G(wds + lev2*(0,)))
    lev2 <<= G(e3 + wds)
    expr = lev0()*(1,)
    expr.ignore(COMMA | COLON)
    

    This gives output that better preserves your hierarchical structure:

    (1) Y denote (a) color (b) pressure
    [[1, [['Y', 'denote'], ['a', [['color']]], ['b', [['pressure']]]]]]
    
    (1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
    [[1,
      [['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
       ['a', [['weight']]],
       ['b', [['height']]],
       ['c', [['depth']]]]],
     [2,
      [['Y', 'denote'],
       ['a', [['color', 'except'], ['i', ['white']], ['ii', ['blue']]]],
       ['b', [['pressure']]]]]]
    

    A full parser would actually comprehend the concepts of "one of the following" vs. "all of the following", and inclusion and exclusion of elements, but that is beyond the scope of this question.