Parsing enumerations with CFG

I have a string that was generated from a rich text enumeration, for example:

"(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure"

I want to construct the original structure, for example:

{"Let X denote one of the following:" : {"weight":{}, "height":{}, "depth":{}} , 
"Y denote": {"color, except": {"white":{}, "blue":{}}}, "pressure":{} }

It is very apparent that this is context-free-grammar, however I had trouble implementing it pyparsing

Edit

I am no expert in CFG, so I hope this BNF representations is correct:

Assuming the following:

w is the equivalent of any character (re.compile("\w*"))
l is equivalent to re.compile("[a-z]")
d is equivalent to `re.compile("\d+")
r is equivalent to roman numerals (i,ii,iii ,...)

Then (hopefully), the BNF should look something like this

<E1>::= "(" <d> ")" | <E1> " "
<E2>::= "(" <l> ")" | <E2> " "
<E3>::= "(" <r> ")" | <E3> " "
<L0>::= <w> | <w> <E1> <L1> <L0>
<L1>::= <w> | <w> <E2> <L2> <L1>
<L2>::= <w> | <w> <E3> <L2>

Solution

Here is the first part of your parser using pyparsing expressions:

import pyparsing as pp

LPAR, RPAR = map(pp.Suppress, "()")
COMMA, COLON = map(pp.Literal, ",:")
wd = pp.Word(pp.alphas)
letter = pp.oneOf(list(pp.alphas.lower())) 
integer = pp.pyparsing_common.integer
roman = pp.Word('ivx')

e1 = LPAR + integer + RPAR
e2 = LPAR + letter + RPAR
e3 = LPAR + roman + RPAR

The next part, based on your BNF, might look like:

# predefine levels using Forwards, since they are recursive
lev0 = pp.Forward()
lev1 = pp.Forward()
lev2 = pp.Forward()

lev0 <<= wd | wd + e1 + lev1 + lev0
lev1 <<= wd | wd + e2 + lev2 + lev1
lev2 <<= wd | wd + e3 + lev2

I assume that lev0 is supposed to parse your test string being the 0'th level of nesting.

As I mentioned in the comments to your question, this fails immediately since your test string starts with "(1)" but your levels don't start with any of the e expressions.

Before continuing, your BNF implements repetition in a classic BNF form:

e ::= some expression
list_of_e ::= e (list_of_e | empty)

In pyparsing, you would implement this a little more directly:

wd = pp.Word(pp.alphas)
list_of_wd = pp.OneOrMore(wd)
# or using tuple multiplication short-hand
list_of_wd = wd * (1,)

Looking at your example, I rewrote the levels of your BNF as:

wds = pp.Group(wd*(1,))
lev0 <<= e1 + wds + lev1*(0,)
lev1 <<= e2 + wds + lev2*(0,)
lev2 <<= e3 + wds
expr = lev0()*(1,)
expr.ignore(COMMA | COLON)

(I didn't see the commas or colons helping in the parsing, so I just ignore them.)

Using expr to parse your string:

tests = """\
(1) Y denote (a) color (b) pressure
(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
"""

for test in tests.splitlines():
    print(test)
    expr.parseString(test).pprint()
    print()

We get:

(1) Y denote (a) color (b) pressure
[1, ['Y', 'denote'], 'a', ['color'], 'b', ['pressure']]

(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
[1,
 ['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
 'a',
 ['weight'],
 'b',
 ['height'],
 'c',
 ['depth'],
 2,
 ['Y', 'denote'],
 'a',
 ['color', 'except'],
 'i',
 ['white'],
 'ii',
 ['blue'],
 'b',
 ['pressure']]

So it parsed, in the sense that it got through the whole input string, but all we've done is basic tokenizing, and have not represented any of the structure implied by your integer/alpha/roman nested lists.

Pyparsing includes a grouping class to structure the results:

G = pp.Group
wds = G(wd*(1,))
lev0 <<= G(e1 + G(wds + lev1*(0,)))
lev1 <<= G(e2 + G(wds + lev2*(0,)))
lev2 <<= G(e3 + wds)
expr = lev0()*(1,)
expr.ignore(COMMA | COLON)

This gives output that better preserves your hierarchical structure:

(1) Y denote (a) color (b) pressure
[[1, [['Y', 'denote'], ['a', [['color']]], ['b', [['pressure']]]]]]

(1) Let X denote one of the following: (a) weight (b) height (c) depth (2) Y denote (a) color, except (i) white (ii) blue (b) pressure
[[1,
  [['Let', 'X', 'denote', 'one', 'of', 'the', 'following'],
   ['a', [['weight']]],
   ['b', [['height']]],
   ['c', [['depth']]]]],
 [2,
  [['Y', 'denote'],
   ['a', [['color', 'except'], ['i', ['white']], ['ii', ['blue']]]],
   ['b', [['pressure']]]]]]

A full parser would actually comprehend the concepts of "one of the following" vs. "all of the following", and inclusion and exclusion of elements, but that is beyond the scope of this question.