Search code examples
pythonpython-2.xtext-parsing

Getting string from python within nested parentheses


I have a string like

(0 (1 (2 (3 (4 (5 The) (6 room)) (7 (8 was) (9 (10 very) (11 good)))) (12 but)) (13 (14 (15 the) (16 food)) (17 (18 was) (19 (20 very) (21 bad))))) (22 .))

Which actually is a tree :

enter image description here

I want to get achieve having string for a given node, i.e. if a say node 0 I should recieve "The room was very good but food was bad." if I say node 2 I should receive "The room was very good but" and for node 5 "The" and so on.


Solution

  • I would first build the obvious tree (where nodes have children nodes and possibly a string payload) then process it to get the alternative you want (with a string including children's payloads). E.g, a rough draft (no error-checking &c):

    class Node(object):
        def __init__(self, n):
            self.n = n
            self.children = []
            self.text = []
            self.payload = self.wholestring = ''
    
    def make_payload_tree(astring):
        root = Node(-1)
        parents = [root]
        sit = iter(astring)
        for c in sit:
            if c=='(':
                mkn = []
                for c in sit:
                    if c==' ': break
                    mkn.append(c)
                newnode = Node(int(''.join(mkn)))
                parents[-1].children.append(newnode)
                parents.append(newnode)
            elif c==')':
                oldnode = parents.pop()
                oldnode.payload = ''.join(oldnode.text)
            else:
                parents[-1].text.append(c)
      return root
    

    You can roughly verify that this payload-tree is correct e.g with:

    def print_tree(r, ind=0):
        print ' '*ind, r.n, r.payload, r.wholestring
        for c in r.children:
            print_tree(c, ind + 2)
    

    Of course, at this point, the wholestring will still be empty strings.

    Now, a second pass lets you build the wholestring attributes:

    def makewhole(node):
        for c in node.children:
            makewhole(c)
        s = node.payload + ' '.join(c.wholestring for c in node.children)
        node.wholestring = s.replace('  ', ' ')
    

    and the print_tree should verify you have the wholestrings you want.

    Now the interesting part is to put in place proper error diagnosis (this code is quite frail if there is any "syntax error" in the input string, and the syntax thereof is implied by your example, never made explicit), but that's probably best done with a proper lexer and parser approach rather than with ad-hoc parsing like I'm doing here.