Search code examples
pythonpython-3.xstringparsingstring-parsing

How can I create a nested dictionary object from tree-like file-directory text-file?


I have a tree-structure separated by tabs and lines like this:

a
\t1
\t2
\t3
\t\tb
\t\tc
\t4
\t5

And I am looking to turn this into:

{
'name': 'a',
'children': [
 {'name': '1'},
 {'name': '2'},
 {
   'name': '3'
   'children': [
      {'name': 'b'},
      {'name': 'c'}
    ]
  },
  {'name': '4'},
  {'name': '5'}
  ]
}

for a d3.js collapsible tree data input. I am assuming I have to use recursion in some way but I cannot figure out how.

I have tried turning the input into a list like this:

[('a',0), ('1',1), ('2',1), ('3',1), ('b',2), ('c',2), ('4',1), ('5',1)]

using this code:

def parser():
    #run from root `retail-tree`: `python3 src/main.py`
    l, all_line_details = list(), list()
    with open('assets/retail') as f:
        for line in f:
            line = line.rstrip('\n ')
            splitline = line.split('    ') 
            tup = (splitline[-1], len(splitline)-1)
            l.append(splitline)
            all_line_details.append(tup)
            print(tup)
    return all_line_details

Here, the first element is the string itself and the second is the number of tabs there are in that line. Unsure of the recursion step to accomplish this. Appreciate any help!


Solution

  • You can use a function that uses re.findall with a regex that matches a line as the name of the node, followed by 0 or more lines that start with a tab, grouped as the children, and then recursively builds the same structure for the children after stripping the first tab of each line from the children string:

    import re
    def parser(s):
        output = []
        for name, children in re.findall(r'(.*)\n((?:\t.*\n)*)', s):
            node = {'name': name}
            if children:
                node.update({'children': parser(''.join(line[1:] for line in children.splitlines(True)))})
            output.append(node)
        return output
    

    so that given:

    s = '''a
    \t1
    \t2
    \t3
    \t\tb
    \t\tc
    \t4
    \t5
    '''
    

    parser(s)[0] returns:

    {'name': 'a',
     'children': [{'name': '1'},
                  {'name': '2'},
                  {'name': '3', 'children': [{'name': 'b'}, {'name': 'c'}]},
                  {'name': '4'},
                  {'name': '5'}]}