I am working on this project and I need to create columns with hierarchal relationship to each other in pandas based on the original two columns that indicates hierarchal relationships of each value.
supervisory_org | Superior_org | |
---|---|---|
0 | org_2 | org_1 |
1 | org_7 | org_3 |
2 | org_4 | org_2 |
3 | org_6 | org_3 |
4 | org_9 | org_5 |
5 | org_3 | org_1 |
6 | org_5 | org_3 |
7 | org_8 | org_5 |
Above are the two original columns that indicates the relationship between two organizations(values). I want to make those hierarchal relationship between orgs more visible by spreading across multiple columns as below. (Below is the desired output)
Level_1 | Level_2 | Level_3 | Level_4 | |
---|---|---|---|---|
0 | org_1 | org_2 | org_4 | NaN |
1 | org_1 | org_3 | org_6 | NaN |
2 | org_1 | org_3 | org_7 | NaN |
3 | org_1 | org_3 | org_5 | org_8 |
4 | org_1 | org_3 | org_5 | org_9 |
I'm trying to make this work in pandas but still haven't come up with a way to do it. Anyone have any suggestions on how to approach this problems?
This is the code for creating the first table
df = pd.DataFrame({"supervisory_org" : ["org_2","org_7" ,"org_4", "org_6","org_9","org_3","org_5", "org_8"],
"Superior_org" : ["org_1","org_3","org_2","org_3","org_5","org_1","org_3","org_5"]})
The usual merge behavior with matching one to many ensures we duplicate rows for superior orgs as appropriate when needed.
We rename and remove columns as needed in the loop to handle the iterated merge.
import pandas as pd
df = pd.DataFrame({"supervisory_org" : ["org_2","org_7" ,"org_4", "org_6","org_9","org_3","org_5", "org_8"],
"Superior_org" : ["org_1","org_3","org_2","org_3","org_5","org_1","org_3","org_5"]})
# Find the roots - orgs without superiors
roots = (pd.Index(df['Superior_org'].unique()).difference(df['supervisory_org'])
.to_series()
.rename("Level_1")
)
roots
org_1 org_1
Name: Level_1, dtype: object
# Use merge to find those ranked just below their superiors
result = roots
upper_col = "Superior_org"
lower_col = 'supervisory_org'
for index in range(1, 100):
result = pd.merge(result, df,
left_on=f"Level_{index}",
right_on=upper_col,
how="left")
result.pop(upper_col)
# if there are no more lower levels, done
if result[lower_col].isna().all():
result.pop(lower_col)
break
result = result.rename(columns={lower_col: f'Level_{index + 1}'})
result
Level_1 Level_2 Level_3 Level_4
0 org_1 org_2 org_4 NaN
1 org_1 org_3 org_7 NaN
2 org_1 org_3 org_6 NaN
3 org_1 org_3 org_5 org_9
4 org_1 org_3 org_5 org_8
The sorting rule apparent from your example is first by depth, then alphabethical.
So we can compute the depth, then sort, and remove the depth column.
result.insert(0, 'depth', result.notna().sum(axis=1))
result = (result.sort_values(by=list(result.columns), ignore_index=True)
.drop(columns='depth')
)
result
Level_1 Level_2 Level_3 Level_4
0 org_1 org_2 org_4 NaN
1 org_1 org_3 org_6 NaN
2 org_1 org_3 org_7 NaN
3 org_1 org_3 org_5 org_8
4 org_1 org_3 org_5 org_9