I am trying to go through a hierarchy dataframe and record every possible routes into another dataframe. These routes can have variable depth.
Original dataframe (df). The highest column means that the value in the parent column is not a child of any:
parent | child | highest |
---|---|---|
a | b | 1 |
b | c | 0 |
b | d | 0 |
d | e | 0 |
End goal dataframe:
level 3 | level 2 | level 1 | level 0 |
---|---|---|---|
a | b | c | |
a | b | d | e |
This what I currently have
def search(parent):
for i in range(df.shape[0]):
if(df.iloc[i,0] == parent):
search(df.iloc[i,1])
for i in range(df.shape[0]):
if(df.iloc[i,2] == 1):
search(df.iloc[i,0])
I am able to go through the hierarchy but I do not know how to save it in the format I want.
You can use networkx
to solve the problem. Note if you use networkx
you don't need the highest
columns. The main function to find all paths is all_simple_paths
# Python env: pip install networkx
# Anaconda env: conda install networkx
import networkx as nx
# Create network from your dataframe
#G = nx.from_pandas_edgelist(df, source='parent', target='child',
# create_using=nx.DiGraph)
# For older versions of networkx
G = nx.DiGraph()
for _, (source, target) in df[['parent', 'child']].iterrows():
G.add_edge(source, target)
# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]
# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Find all paths
paths = []
for root in roots:
for leaf in leaves:
for path in nx.all_simple_paths(G, root, leaf):
paths.append(path)
# Create a new dataframe
out = pd.DataFrame(paths).fillna('')
out.columns = reversed(out.add_prefix('level ').columns)
Output:
>>> out
level 3 level 2 level 1 level 0
0 a b c
1 a b d e