Search code examples
pythonparsingtreetext-parsingparse-tree

How to parse a tree from a string in Python?


I have formatted all my university notes as follows:

CourseName: {
    Part 1: {
        I.I - Intro: {
            Topic1: {
                descr1;
                descr2: {
                    2.a;
                    2.b;
                    2.c.
                };
                descr3.
            };
            Topic2: {
                descr: {
                    example.
                }.
            }.
        };
        I.II - NextChapter: {
            Topic3: {
                whatever.
            }.
        }.
    };
    Part 2: {
        II.I - FinalChapter: {
            content.
        }.
    }.
}

I'd like to structure them into a Tree data structure and I've tried doing so, both recursively and iteratively, in the past hours, doing many researches online, but none of my attempts at it is working.

I've already implemented a Node class (with self.__value and a list self.__children and all the useful methods you would expect from it) as well as a Tree class (with self.__nodes as a dictionary and other utility methods), so feel free to use methods such as add_node or add_child in any form of your liking in your answers.

What I'm struggling with is to understand how to structure the function def parseTree(s, l) - that ideally takes as inputs a string s (my notes) and a list l establishing the delimiters i.e. [":{", ";", "}."] or ["{","}"] or similar - and returns a tree object, with each node having as value the text preceding :{ and a list of children (if any) separated by ; in the text.

Any suggestion?


Solution

  • Assuming your data is stored in a file, you can build a simple class to parse the structure into a dictionary. You can recursively traverse the data by creating a new Notes object for each key found:

    file_data = filter(None, [i.strip('\n') for i in open('filename.txt')])
    import re
    class Notes:
       def __init__(self, token_data):
         self.token_data = token_data
         self.current_dict = {}
         self.current_vals = []
         self.parse()
       def parse(self):
         while True:
           start = next(self.token_data, None)
           if not start or "}" in start:
             break
           if start.endswith('{'):
              note = Notes(self.token_data)
              final_result = filter(lambda x:x, note.current_vals + [note.current_dict]) if note.current_vals else note.current_dict
              self.current_dict[re.findall('[\w\s\-\.]+', re.sub('^\s+', '', start))[0]] = final_result[0] if isinstance(final_result, list) and len(final_result) == 1 else final_result
              self.token_data = note.token_data
           else:
              self.current_vals.append(re.sub('^\s+', '', start))
    
    
    course_notes = Notes(iter(file_data)).current_dict
    

    Output:

    {'CourseName': 
        {'Part 1': 
          {'I.I - Intro': 
             {'Topic1': ['descr1;',
                        'descr3.',
                 {'descr2': ['2.a;',
                             '2.b;',
                              '2.c.']
                    }
                   ],
             'Topic2': {'descr': 'example.'}
                     },
                      'I.II - NextChapter': 
                   {'Topic3': 'whatever.'}
                 },
           'Part 2':{'II.I - FinalChapter': 'content.'}
         } 
       }