Search code examples
pythonrecursiontreeparent-child

How to resolve common children in tree and give them unique names in Python?


I have the following tree represented in parent/child relationships:

import pandas as pd
df = pd.DataFrame(columns=['Parent','Child'])
df['Parent']=["A","A","A","B","B","B","C","C","F","G","G"]
df['Child']=["B","C","E","D","E","F","F","G","H","H","I"]

Some of the nodes have multiple parents. This should be removed by giving these common children different IDs based on the path. This is how it should look like after (right tree):

enter image description here

I wrote a funciton tha is supposed to write a path for every node and add it to the name, the results are collected in the dictionary "res". After a few day of trying this seems to be a bad approach as it doeas not split the paths. Below an example for node H.

Any ideas how the tree can be transformed?

res = {}
def find_parent(child, path):

    path.append(str(child))
    print("Path:    ", path)

    parents = df.loc[df['Child'] == child, ['Parent']]['Parent'].tolist()
    print("Parents: ",parents)

    if not parents:
        print("Path end reached!")
        print("Result: ", res)
        # i+1

    else :
        for i in range(0,len(parents)-1):
            if len(parents)>1: #dann neue paths
                path = [(str(child))]

                new_path = 'path_{}'.format(i)
                print("-->add ",parents[i])
                res[new_path] = str(''.join(path)) + parents[i]

                print("Result: ", res)
                print()
                find_parent(parents[i], path)

            else: 
                new_path = 'path_{}'.format(i)
                print("-->add ",parents[i])
                res[new_path] = str(''.join(path)) + parents[i]

                print("Result: ", res)
                print()

                find_parent(parents[0],path)

    return res

Example result for node "H"

find_parent("H", [])

{'path_0': 'FB'}

It should give H_FBA, HFCA and H_GCA.


Solution

  • You can do this with a variety of typical graph transformations. Before diving into the code, we're working with a directed acyclic graph with one source node (that is, node with no incoming edges) in the original dataframe. These properties simplify matters quite a bit and guarantee that we can create a unique tree as shown in your desired output.

    The flat dataframe isn't terribly easy to manipulate as a graph, so the first step I took was to turn it into an adjacency list.

    Next, there are a few preparatory steps to perform:

    • Find the source node (although I used a generic graph function that handles more than one source, we're guaranteed there'll only be one source so we'll take the only item from the returned set). The lone source node will become the root of the new tree.
    • Run a function to extract all of the paths in the DAG starting from the source.
    • Create a dict that maps node names to incoming edge counts which helps guide the renaming procedure.
    • Create a renaming dictionary that lets us keep track of which nodes have been renamed.

    Having built these structures, iterate over all possible paths in the graph to build the relabeled graph, formatting nodes with multiple incoming edges according to your spec (a reversed path node list) and storing the new names in the renaming dict.

    Finally, sort a flattened version of this relabeled graph and dump it in the result dataframe.

    Here's the code:

    import pandas as pd
    from collections import defaultdict
    
    def find_source_nodes(graph):
        sources = set(graph)
    
        for neighbors in graph.values():
            sources = sources - neighbors
    
        return sources
    
    def count_incoming_edges(graph):
        counts = defaultdict(int)
    
        for neighbors in graph.values():
            for neighbor in neighbors:
                counts[neighbor] += 1
    
        return counts
    
    def find_all_paths_in_dag(graph, src, path=[]):
        path.append(src)
    
        if src in graph and graph[src]:
            for neighbor in graph[src]:
                yield from find_all_paths_in_dag(graph, neighbor, path)
        else:
            yield path[:]
    
        path.pop()
    
    def flatten_adjacency_list(adj_list):
        flat_graph = []
    
        for parent, children in adj_list.items():
            flat_graph.extend([(parent, child) for child in children])
    
        return flat_graph
    
    def relabel_dag(graph, root):
        relabeled_graph = defaultdict(set)
        all_paths = find_all_paths_in_dag(graph, root)
        incoming_edge_counts = count_incoming_edges(graph)
        renamed = {k: k for k in graph}
    
        for path in all_paths:
            for src, dst, i in zip(path, path[1:], range(len(path) - 1)):
                if incoming_edge_counts[dst] > 1:
                    renamed[dst] = dst = f"{dst}_{''.join(path[:i+1][::-1])}"
    
                relabeled_graph[renamed[src]].add(dst)
    
        return relabeled_graph
    
    if __name__ == "__main__":
        df = pd.DataFrame(columns=["Parent", "Child"])
        df["Parent"] = ["A", "A", "A", "B", "B", "B", "C", "C", "F", "G", "G"]
        df["Child"] = ["B", "C", "E", "D", "E", "F", "F", "G", "H", "H", "I"]
        graph = defaultdict(set)
    
        for parent, child in zip(df["Parent"], df["Child"]):
            graph[parent].add(child)
    
        root = next(iter(find_source_nodes(graph)))
        relabeled_tree = relabel_dag(graph, root)
        flat_relabeled_tree = sorted(flatten_adjacency_list(relabeled_tree))
        relabeled_df = pd.DataFrame(flat_relabeled_tree, columns=["Parent", "Child"])
        print(relabeled_df)
    

    Output:

       Parent  Child
    0       A      B
    1       A      C
    2       A    E_A
    3       B      D
    4       B   E_BA
    5       B   F_BA
    6       C   F_CA
    7       C      G
    8    F_BA  H_FBA
    9    F_CA  H_FCA
    10      G  H_GCA
    11      G      I