Search code examples
pythonnestedpyparsing

Parsing nested lists and returning original strings for every valid list


Suppose I have a string s = '{aaaa{bc}xx{d{e}}f}', which has a structure of nested lists. I would like to have an hierarchical representation for it, while being able to access the sub-strings corresponding to the valid sub-lists. For simplicity, let's forget about the hierarchy, and I just want a list of sub-strings corresponding to valid sub-lists, something like:

['{aaaa{bc}xx{d{e}}f}', '{bc}', '{d{e}}', '{e}']

Using nestedExpr, one can obtain the nested structure, which includes all valid sub-lists:

import pyparsing as pp

s = '{aaaa{bc}xx{d{e}}f}'
not_braces = pp.CharsNotIn('{}')
expr = pp.nestedExpr('{', '}', content=not_braces)
res = expr('L0 Contents').parseString(s)
print(res.dump())

prints:

[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
- L0 Contents: [['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
  [0]:
    ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
    [0]:
      aaaa
    [1]:
      ['bc']
    [2]:
      xx
    [3]:
      ['d', ['e']]
      [0]:
        d
      [1]:
        ['e']
    [4]:
      f

In order to obtain the original string representation for a parsed element, I have to wrap it into pyparsing.originalTextFor(). However, this will remove all sub-lists from the result:

s = '{aaaa{bc}xx{d{e}}f}'
not_braces = pp.CharsNotIn('{}')
expr = pp.nestedExpr('{', '}', content=not_braces)
res = pp.originalTextFor(expr)('L0 Contents').parseString(s)
print(res.dump())

prints:

['{aaaa{bc}xx{d{e}}f}']
- L0 Contents: '{aaaa{bc}xx{d{e}}f}'

In effect, the originalTextFor() wrapper flattened out everything that was inside it.

The question. Is there an alternative to originalTextFor() that keeps the structure of its child parse elements? (It would be nice to have a non-discarding analogue, which could be used for creation of named tokens for parsed sub-expressions)

Note that scanString() will only give me the level 0 sub-lists, and will not look inside. I guess, I could use setParseAction(), but the mode of internal operation of ParserElement's is not documented, and I haven't had a chance to dig into the source code yet. Thanks!

Update 1. Somewhat related: https://stackoverflow.com/a/39885391/11932910 https://stackoverflow.com/a/17411455/11932910


Solution

  • Instead of using originalTextFor, wrap your nestedExpr expression in locatedExpr:

    import pyparsing as pp
    parser = pp.locatedExpr(pp.nestedExpr('{','}'))
    

    locatedExpr will return a 3-element ParseResults:

    • start location
    • parsed value
    • end location

    You can then attach a parse action to this parser to modify the parsed tokens in place, and add your own original_string named result, containing the original text as sliced from the input string:

    def extract_original_text(st, loc, tokens):
        start, tokens[:], end = tokens[0]
        tokens['original_string'] = st[start:end]
    parser.addParseAction(extract_original_text)
    

    Now use this parser to parse and dump the results:

    result = parser.parseString(s)
    print(result.dump())
    

    Prints:

    ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
    - original_string: '{aaaa{bc}xx{d{e}}f}'
    

    And access the original_string result using:

    print(result.original_string)
    

    EDIT - how to attach original_string to each nested substructure

    To maintain the original strings on the sub-structures requires a bit more work than can be done in just nested_expr. You pretty much have to implement your own recursive parser.

    To implement your own version of nested_expr, you'll start with something like this:

    LBRACE, RBRACE = map(pp.Suppress, "{}")
    expr = pp.Forward()
    
    term = pp.Word(pp.alphas)
    expr_group = pp.Group(LBRACE + expr + RBRACE)
    expr_content = term | expr_group
    
    expr <<= expr_content[...]
    
    print(expr.parseString(sample).dump())
    

    This will dump out the parsed results, without the 'original_string' names:

    {aaaa{bc}xx{d{e}}f}
    [['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
    [0]:
      ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
      [0]:
        aaaa
      [1]:
        ['bc']
      [2]:
        xx
      [3]:
        ['d', ['e']]
        [0]:
          d
        [1]:
          ['e']
      [4]:
        f
    

    To add the 'original_string' names, we first change the Group to the locatedExpr wrapper.

    expr_group = pp.locatedExpr(LBRACE + expr + RBRACE)
    

    This will add the start and end locations to each nested subgroup (which is not accessible to you when using nestedExpr).

    {aaaa{bc}xx{d{e}}f}
    [[0, 'aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f', 19]]
    [0]:
      [0, 'aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f', 19]
      - locn_end: 19
      - locn_start: 0
      - value: ['aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f']
        [0]:
          aaaa
        [1]:
          [5, 'bc', 9]
          - locn_end: 9
          - locn_start: 5
          - value: ['bc']
    ...
    

    Our parse action is now more complicated also.

    def extract_original_text(st, loc, tokens):
        # pop/delete names and list items inserted by locatedExpr
        # (save start and end locations to local vars)
        tt = tokens[0]
        start = tt.pop("locn_start")
        end = tt.pop("locn_end")
        tt.pop("value")
        del tt[0]
        del tt[-1]
    
        # add 'original_string' results name
        orig_string = st[start:end]
        tt['original_string'] = orig_string
    
    expr_group.addParseAction(extract_original_text)
    

    With this change, you will now get this structure:

    {aaaa{bc}xx{d{e}}f}
    [['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
    [0]:
      ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
      - original_string: '{aaaa{bc}xx{d{e}}f}'
      [0]:
        aaaa
      [1]:
        ['bc']
        - original_string: '{bc}'
      [2]:
        xx
      [3]:
        ['d', ['e']]
        - original_string: '{d{e}}'
        [0]:
          d
        [1]:
          ['e']
          - original_string: '{e}'
      [4]:
        f
    

    Note: There is a limitation in the current version of ParseResults.dump that only shows keys or subitems, but not both - this output requires a fix that removes that limitation, to be released in next pyparsing version. But even though dump() does not show these substructures, they are there in your actual structure, as you can see if you print out the repr of the results:

    print(repr(result[0]))
    
    (['aaaa', (['bc'], {'original_string': '{bc}'}), 'xx', (['d', (['e'], {'original_string': '{e}'})], {'original_string': '{d{e}}'}), 'f'], {'original_string': '{aaaa{bc}xx{d{e}}f}'})