I have the following tree represented in parent/child relationships:
import pandas as pd
df = pd.DataFrame(columns=['Parent','Child'])
df['Parent']=["A","A","A","B","B","B","C","C","F","G","G"]
df['Child']=["B","C","E","D","E","F","F","G","H","H","I"]
Some of the nodes have multiple parents. This should be removed by giving these common children different IDs based on the path. This is how it should look like after (right tree):
I wrote a funciton tha is supposed to write a path for every node and add it to the name, the results are collected in the dictionary "res". After a few day of trying this seems to be a bad approach as it doeas not split the paths. Below an example for node H.
Any ideas how the tree can be transformed?
res = {}
def find_parent(child, path):
path.append(str(child))
print("Path: ", path)
parents = df.loc[df['Child'] == child, ['Parent']]['Parent'].tolist()
print("Parents: ",parents)
if not parents:
print("Path end reached!")
print("Result: ", res)
# i+1
else :
for i in range(0,len(parents)-1):
if len(parents)>1: #dann neue paths
path = [(str(child))]
new_path = 'path_{}'.format(i)
print("-->add ",parents[i])
res[new_path] = str(''.join(path)) + parents[i]
print("Result: ", res)
print()
find_parent(parents[i], path)
else:
new_path = 'path_{}'.format(i)
print("-->add ",parents[i])
res[new_path] = str(''.join(path)) + parents[i]
print("Result: ", res)
print()
find_parent(parents[0],path)
return res
Example result for node "H"
find_parent("H", [])
{'path_0': 'FB'}
It should give H_FBA, HFCA and H_GCA.
You can do this with a variety of typical graph transformations. Before diving into the code, we're working with a directed acyclic graph with one source node (that is, node with no incoming edges) in the original dataframe. These properties simplify matters quite a bit and guarantee that we can create a unique tree as shown in your desired output.
The flat dataframe isn't terribly easy to manipulate as a graph, so the first step I took was to turn it into an adjacency list.
Next, there are a few preparatory steps to perform:
Having built these structures, iterate over all possible paths in the graph to build the relabeled graph, formatting nodes with multiple incoming edges according to your spec (a reversed path node list) and storing the new names in the renaming dict.
Finally, sort a flattened version of this relabeled graph and dump it in the result dataframe.
Here's the code:
import pandas as pd
from collections import defaultdict
def find_source_nodes(graph):
sources = set(graph)
for neighbors in graph.values():
sources = sources - neighbors
return sources
def count_incoming_edges(graph):
counts = defaultdict(int)
for neighbors in graph.values():
for neighbor in neighbors:
counts[neighbor] += 1
return counts
def find_all_paths_in_dag(graph, src, path=[]):
path.append(src)
if src in graph and graph[src]:
for neighbor in graph[src]:
yield from find_all_paths_in_dag(graph, neighbor, path)
else:
yield path[:]
path.pop()
def flatten_adjacency_list(adj_list):
flat_graph = []
for parent, children in adj_list.items():
flat_graph.extend([(parent, child) for child in children])
return flat_graph
def relabel_dag(graph, root):
relabeled_graph = defaultdict(set)
all_paths = find_all_paths_in_dag(graph, root)
incoming_edge_counts = count_incoming_edges(graph)
renamed = {k: k for k in graph}
for path in all_paths:
for src, dst, i in zip(path, path[1:], range(len(path) - 1)):
if incoming_edge_counts[dst] > 1:
renamed[dst] = dst = f"{dst}_{''.join(path[:i+1][::-1])}"
relabeled_graph[renamed[src]].add(dst)
return relabeled_graph
if __name__ == "__main__":
df = pd.DataFrame(columns=["Parent", "Child"])
df["Parent"] = ["A", "A", "A", "B", "B", "B", "C", "C", "F", "G", "G"]
df["Child"] = ["B", "C", "E", "D", "E", "F", "F", "G", "H", "H", "I"]
graph = defaultdict(set)
for parent, child in zip(df["Parent"], df["Child"]):
graph[parent].add(child)
root = next(iter(find_source_nodes(graph)))
relabeled_tree = relabel_dag(graph, root)
flat_relabeled_tree = sorted(flatten_adjacency_list(relabeled_tree))
relabeled_df = pd.DataFrame(flat_relabeled_tree, columns=["Parent", "Child"])
print(relabeled_df)
Output:
Parent Child
0 A B
1 A C
2 A E_A
3 B D
4 B E_BA
5 B F_BA
6 C F_CA
7 C G
8 F_BA H_FBA
9 F_CA H_FCA
10 G H_GCA
11 G I