Search code examples
pythonpython-3.xpyparsing

How to parse a non-unique positional pattern?


I have two problems connected to parsing a bit of a nasty pattern. Here are some non-sense examples:

examples = [
    "",
    "red green",
    "#1# red green",
    "#1# red green <2>",
    "#1,2# red green <2,3>",
    "red green ()",
    "#1# red green (blue)",
    "#1# red green (#5# blue) <2>",
    "#1# red green (#5# blue <6>) <2>",
    "#1,2# red green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>",
    "#1,2# red (maroon) green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>",
]

I should say at this point that I have no control over the creation of these strings.

As you can see, basically every pattern that I would like to parse is optional. Then there are distinct parts that I would like to capture. I look at the structure of these examples as:

[cars] [colors] [comments] [buyers]

where comments consists of a sub-structure and may be a multiple separated by semi-colon.

comments: ([cars] [colors] [buyers]; ...)

I have created the following grammars in order to capture the content:

import pyparsing as pp

integer = pp.pyparsing_common.integer

car_ref = "#" + pp.Group(pp.delimitedList(integer))("cars") + "#"

buyer_ref = "<" + pp.Group(pp.delimitedList(integer))("buyers") + ">"

My questions are then:

  1. Is there a smart way (maybe through positioning) to distinguish something in parentheses that is part of the colors and not comments?
  2. I have worked a bit on the problem of nested parentheses within comments. My strategy was that I would take the inner string, use ; as a delimiter and break it up. However, I failed to execute that strategy. What I tried is:
sub_comment = (
    pp.Optional(car_ref) +
    pp.Group(pp.ZeroOrMore(pp.Regex(r"[^;#<>\s]")))("colors") +
    pp.Optional(buyer_ref)
)

split_comments = pp.Optional(pp.delimitedList(
    pp.Group(sub_comment)("comments*"),
    delim=";"
))


def parse_comments(original, location, tokens):
    # Strip the parentheses.
    return split_comments.transformString(original[tokens[0] + 1:tokens[2] - 1])


comments = pp.originalTextFor(pp.nestedExpr()).setParseAction(parse_comments)

When I use this everything ends up as one continuous string, presumably because of the outer pp.originalTextFor.

res = comments.parseString("(#5# blue (purple) <6>;#7# yellow <10>)", parseAll=True)

EDIT:

Taking the last example string, I'd like to end up with an object structure that looks like:

{
  "cars": [1, 2],
  "colors": "red (maroon) green",
  "buyers": [2, 3],
  "comments": [
    {
      "cars": [5],
      "colors": "blue (purple)",
      "buyers": [6]
    },
    {
      "cars": [7],
      "colors": "yellow",
      "buyers": [10]
    }
  ]
}

So parentheses within the colors section should be maintained in order and just like in prose. Parentheses that introduce a comments section, I don't care about their order and neither about the order of individual comments.


Solution

  • I think you had most of the pieces in place, you were just struggling with the recursive part, where a comment could itself hold sub-structures, including more comments.

    You had this as your BNF:

    structure ::= [cars] [colors] [comments] [buyers]
    cars ::= '#' integer, ... '#'
    buyers ::= '<' integer, ... '>'
    

    I filled in the blanks with these guesses, based on your given examples:

    color ::= word composed of alphas
    colors ::= (color | '(' color ')' )...
    
    comments ::= '(' structure ';' ... ')'
    

    I took your definitions for cars and buyers, and added colors and the recursive definition for comments. Then did a pretty rote conversion from BNF to pyparsing expressions:

    integer = pp.pyparsing_common.integer
    
    car_ref = "#" + pp.Group(pp.delimitedList(integer))("cars") + "#"
    buyer_ref = "<" + pp.Group(pp.delimitedList(integer))("buyers") + ">"
    
    # not sure if this will be sufficient for color, but it works for the given examples
    color = pp.Word(pp.alphas)
    colors = pp.originalTextFor(pp.OneOrMore(color | '(' + color + ')'))("colors")
    
    # define comment placeholder so it can be used in definition of structure
    comment = pp.Forward()
    
    structure = pp.Group(pp.Optional(car_ref)
                         + pp.Optional(colors)
                         + pp.Optional(comment)("comments")
                         + pp.Optional(buyer_ref))
    
    # now insert the definition of a comment as a delimited list of structures; this takes care of
    # any nesting of comments within comments
    LPAREN, RPAREN = map(pp.Suppress, "()")
    comment <<= pp.Group(LPAREN + pp.Optional(pp.delimitedList(structure, delim=';')) + RPAREN)
    

    The tricky part is to define the contents of comment as a delimited list of structures, and to use the <<= operator to insert that definition into the previously defined Forward() placeholder.

    Passing your examples to structure.runTests() gives (default behavior is to treat Python-like comments as comments, so we have to disable this when calling runTests with your particular examples, since a leading '#' is a valid intro for cars):

    structure.runTests(examples, comment=None)
    
    red green
    [['red green']]
    [0]:
      ['red green']
      - colors: 'red green'
    
    #1# red green
    [['#', [1], '#', 'red green']]
    [0]:
      ['#', [1], '#', 'red green']
      - cars: [1]
      - colors: 'red green'
    
    #1# red green <2>
    [['#', [1], '#', 'red green', '<', [2], '>']]
    [0]:
      ['#', [1], '#', 'red green', '<', [2], '>']
      - buyers: [2]
      - cars: [1]
      - colors: 'red green'
    
    #1,2# red green <2,3>
    [['#', [1, 2], '#', 'red green', '<', [2, 3], '>']]
    [0]:
      ['#', [1, 2], '#', 'red green', '<', [2, 3], '>']
      - buyers: [2, 3]
      - cars: [1, 2]
      - colors: 'red green'
    
    red green ()
    [['red green', [[]]]]
    [0]:
      ['red green', [[]]]
      - colors: 'red green'
      - comments: [[]]
        [0]:
          []
    
    #1# red green (blue)
    [['#', [1], '#', 'red green (blue)']]
    [0]:
      ['#', [1], '#', 'red green (blue)']
      - cars: [1]
      - colors: 'red green (blue)'
    
    #1# red green (#5# blue) <2>
    [['#', [1], '#', 'red green', [['#', [5], '#', 'blue']], '<', [2], '>']]
    [0]:
      ['#', [1], '#', 'red green', [['#', [5], '#', 'blue']], '<', [2], '>']
      - buyers: [2]
      - cars: [1]
      - colors: 'red green'
      - comments: [['#', [5], '#', 'blue']]
        [0]:
          ['#', [5], '#', 'blue']
          - cars: [5]
          - colors: 'blue'
    
    #1# red green (#5# blue <6>) <2>
    [['#', [1], '#', 'red green', [['#', [5], '#', 'blue', '<', [6], '>']], '<', [2], '>']]
    [0]:
      ['#', [1], '#', 'red green', [['#', [5], '#', 'blue', '<', [6], '>']], '<', [2], '>']
      - buyers: [2]
      - cars: [1]
      - colors: 'red green'
      - comments: [['#', [5], '#', 'blue', '<', [6], '>']]
        [0]:
          ['#', [5], '#', 'blue', '<', [6], '>']
          - buyers: [6]
          - cars: [5]
          - colors: 'blue'
    
    #1,2# red green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
    [['#', [1, 2], '#', 'red green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']]
    [0]:
      ['#', [1, 2], '#', 'red green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']
      - buyers: [2, 3]
      - cars: [1, 2]
      - colors: 'red green'
      - comments: [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']]
        [0]:
          ['#', [5], '#', 'blue (purple)', '<', [6], '>']
          - buyers: [6]
          - cars: [5]
          - colors: 'blue (purple)'
        [1]:
          ['#', [7], '#', 'yellow', '<', [10], '>']
          - buyers: [10]
          - cars: [7]
          - colors: 'yellow'
    
    #1,2# red (maroon) green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
    [['#', [1, 2], '#', 'red (maroon) green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']]
    [0]:
      ['#', [1, 2], '#', 'red (maroon) green', [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']], '<', [2, 3], '>']
      - buyers: [2, 3]
      - cars: [1, 2]
      - colors: 'red (maroon) green'
      - comments: [['#', [5], '#', 'blue (purple)', '<', [6], '>'], ['#', [7], '#', 'yellow', '<', [10], '>']]
        [0]:
          ['#', [5], '#', 'blue (purple)', '<', [6], '>']
          - buyers: [6]
          - cars: [5]
          - colors: 'blue (purple)'
        [1]:
          ['#', [7], '#', 'yellow', '<', [10], '>']
          - buyers: [10]
          - cars: [7]
          - colors: 'yellow'
    

    If you convert all the parsed results to regular Python dicts using asDict() you get:

    structure.runTests(examples, comment=None,
                       postParse=lambda test, results: results[0].asDict()
                       )
    
    red green
    {'colors': 'red green'}
    
    #1# red green
    {'cars': [1], 'colors': 'red green'}
    
    #1# red green <2>
    {'colors': 'red green', 'cars': [1], 'buyers': [2]}
    
    #1,2# red green <2,3>
    {'colors': 'red green', 'cars': [1, 2], 'buyers': [2, 3]}
    
    red green ()
    {'comments': [[]], 'colors': 'red green'}
    
    #1# red green (blue)
    {'cars': [1], 'colors': 'red green (blue)'}
    
    #1# red green (#5# blue) <2>
    {'colors': 'red green', 'cars': [1], 'comments': [{'cars': [5], 'colors': 'blue'}], 'buyers': [2]}
    
    #1# red green (#5# blue <6>) <2>
    {'colors': 'red green', 'cars': [1], 'comments': [{'colors': 'blue', 'cars': [5], 'buyers': [6]}], 'buyers': [2]}
    
    #1,2# red green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
    {'colors': 'red green', 'cars': [1, 2], 'comments': [{'colors': 'blue (purple)', 'cars': [5], 'buyers': [6]}, {'colors': 'yellow', 'cars': [7], 'buyers': [10]}], 'buyers': [2, 3]}
    
    #1,2# red (maroon) green (#5# blue (purple) <6>;#7# yellow <10>) <2,3>
    {'colors': 'red (maroon) green', 'cars': [1, 2], 'comments': [{'colors': 'blue (purple)', 'cars': [5], 'buyers': [6]}, {'colors': 'yellow', 'cars': [7], 'buyers': [10]}], 'buyers': [2, 3]}