Search code examples
pythonpandasdataframedata-cleaning

Creating columns with hierarchal relationship to each other from two original columns


I am working on this project and I need to create columns with hierarchal relationship to each other in pandas based on the original two columns that indicates hierarchal relationships of each value.

supervisory_org Superior_org
0 org_2 org_1
1 org_7 org_3
2 org_4 org_2
3 org_6 org_3
4 org_9 org_5
5 org_3 org_1
6 org_5 org_3
7 org_8 org_5

Above are the two original columns that indicates the relationship between two organizations(values). I want to make those hierarchal relationship between orgs more visible by spreading across multiple columns as below. (Below is the desired output)

Level_1 Level_2 Level_3 Level_4
0 org_1 org_2 org_4 NaN
1 org_1 org_3 org_6 NaN
2 org_1 org_3 org_7 NaN
3 org_1 org_3 org_5 org_8
4 org_1 org_3 org_5 org_9

I'm trying to make this work in pandas but still haven't come up with a way to do it. Anyone have any suggestions on how to approach this problems?

This is the code for creating the first table

df = pd.DataFrame({"supervisory_org" : ["org_2","org_7" ,"org_4", "org_6","org_9","org_3","org_5", "org_8"],
                   "Superior_org" : ["org_1","org_3","org_2","org_3","org_5","org_1","org_3","org_5"]})

Solution

    1. Find the roots (those with no org over them)
    2. Use merge iteratively to match superior -> zero or more next rank orgs

    The usual merge behavior with matching one to many ensures we duplicate rows for superior orgs as appropriate when needed.

    We rename and remove columns as needed in the loop to handle the iterated merge.

    import pandas as pd
    
    df = pd.DataFrame({"supervisory_org" : ["org_2","org_7" ,"org_4", "org_6","org_9","org_3","org_5", "org_8"],
                       "Superior_org" : ["org_1","org_3","org_2","org_3","org_5","org_1","org_3","org_5"]})
    
    # Find the roots - orgs without superiors
    roots = (pd.Index(df['Superior_org'].unique()).difference(df['supervisory_org'])
             .to_series()
             .rename("Level_1")
            )
    roots
    
    org_1    org_1
    Name: Level_1, dtype: object
    
    # Use merge to find those ranked just below their superiors
    
    result = roots
    upper_col = "Superior_org"
    lower_col = 'supervisory_org'
    for index in range(1, 100):
        result = pd.merge(result, df,
                          left_on=f"Level_{index}",
                          right_on=upper_col,
                          how="left")
        result.pop(upper_col)
        # if there are no more lower levels, done
        if result[lower_col].isna().all():
            result.pop(lower_col)
            break
        result = result.rename(columns={lower_col: f'Level_{index + 1}'})
        
    result
    
      Level_1 Level_2 Level_3 Level_4
    0   org_1   org_2   org_4     NaN
    1   org_1   org_3   org_7     NaN
    2   org_1   org_3   org_6     NaN
    3   org_1   org_3   org_5   org_9
    4   org_1   org_3   org_5   org_8
    

    Sorting

    The sorting rule apparent from your example is first by depth, then alphabethical.

    So we can compute the depth, then sort, and remove the depth column.

    result.insert(0, 'depth', result.notna().sum(axis=1))
    
    result = (result.sort_values(by=list(result.columns), ignore_index=True)
              .drop(columns='depth')
             )
    result
    
      Level_1 Level_2 Level_3 Level_4
    0   org_1   org_2   org_4     NaN
    1   org_1   org_3   org_6     NaN
    2   org_1   org_3   org_7     NaN
    3   org_1   org_3   org_5   org_8
    4   org_1   org_3   org_5   org_9