Search code examples
pythondictionarytreerdftaxonomy

Creating hierarchical tree from nested dictionary in python


I have a large RDF file containing about 600,000 records of Wikidata taxonomy. From this file, I am only interested in subclassOf relations (predicate), thus, I ignore all the other statements keeping only "subclassOf" statements. The statement is like: a is a subclassOf b, b is a subclassOf c Like that c is a parent of b and b is a parent of a. And any parent can have many children. I want to build hierarchical tree using this taxonomy. I have checked this thread and it almost solved my problem. Recursively creating a tree hierarchy without using class/object However, with this, I am getting tree in dictionary which I want to convert into tree data-structure. Following is what I have tried:

data = [['a','x'], ['b','x'], ['c','x'], ['x','y'], ['t','y'], ['y','p'], ['p','q']]

roots = set()
mapping = {}
for child,parent in data:        
    childitem = mapping.get(child,None)
    if childitem is None:
        childitem =  {}
        mapping[child] = childitem
    else:
        roots.discard(child)
    parentitem = mapping.get(parent,None)
    if parentitem is None:
        mapping[parent] = {child:childitem}
        roots.add(parent)
    else:
        parentitem[child] = childitem

for root in roots:
    print(mapping[root])

tree = { id : mapping[id] for id in roots }
print(tree)

Output of tree looks as below:

{'q': {'p': {'y': {'t': {}, 'x': {'c': {}, 'b': {}, 'a': {}}}}}}

I want to convert this dictionary to tree. So e.g. when I say print(mapping['y']), it should give me Node y i.e.

q
├── p
    └── y

currently, if I say mapping['y'], it gives me subtree rooted at y. I think there is some easy solution for this, but I am not able to understand it. I have found this link as well https://gist.github.com/hrldcpr/2012250 to convert dictionary into tree, but not sure how to use it in my case. Alernatively, if anyone knows to directly built a tree from the RDF data that I had given above, then it will be most welcome. Probably python's anytree API will solve my issue.


Solution

  • If you don't mind extra O(N) space, you can keep a parents dictionary, storing the value parent for every key child. And populate it in main for loop.

    Finding all ancestors is now pretty easy. Recursively find all ancestors of your parent and append current node to that result.

    data = [['a','x'], ['b','x'], ['c','x'], ['x','y'], ['t','y'], ['y','p'], ['p','q']]
    parents = {} #to store parents
    roots = set()
    mapping = {}
    for child,parent in data:
        parents[child] = parent #populate parents
        childitem = mapping.get(child,None)
        ................................
    
    def ancestors(node): #the ancestor-finding function
        if not node: return []
        return ancestors(parents.get(node))+[node]
    
    def first_k_ancestor(node,k=5):
        ances = ancestors(node)
        ances.reverse()
        return ances[:k]
    
    
    print(ancestors('a'))
    

    which prints:

    ['q', 'p', 'y']