Search code examples
pythonpandasdataframehierarchy

create hierarchy using two columns in pandas


Data I am working with is below:

Name RefSecondary     RefMain
test  2               3   
bet   3               4   
get   1               2   
set   null            1   
net   3               5

I have done a very simple query which looks up the presence of values in dataframe and build hierarchy

sys_role = 'sample.xlsx'
df = pd.read_excel(sys_role,na_filter = False).apply(lambda x: x.astype(str).str.strip())
for i in range(count):
    for j in range(count):
        if df.iloc[i]['RefMain'] == df.iloc[j]['RefSecondary']:
            df.iloc[j, df.columns.get_loc('Name')] = "/".join([df.iloc[i]['Name'],df.iloc[j]['Name']])
    j = j+1
i = i+1

The results I am getting is below:

   Result          RefMain
0  get/test           3
1  test/bet           4
2  set/get            2
3  set                1
4  test/net           5

This is really slow and the logic doesn't work perfectly as well. Is there a way I can get this done faster?

Logic needs to be as below:

 1)Take a value from column RefMain,and find its correspoding RefSecondary value.  
 2)Look up the RefSecondary value  in RefMain, 
 3)If found Back to Step 1 and repeat.
 4)This continues recursively till no value/null is found in RefSecondary column.

Resultant dataframe should look like below:

   Result            RefMain
0  set/get/test          3
1  set/get/test/bet      4
2  set/get               2
3  set                   1
4  set/get/test/net      5

Solution

  • This sounds like a graph problem. You can try networkx as follows:

    df = df.fillna(-1)
    
    # create a graph
    G = nx.DiGraph()
    
    # add reference as edges
    G.add_edges_from(zip(df['RefMain'],df['RefSecondary'] ))
    
    # rename the nodes accordingly
    G = nx.relabel_nodes(G, mapping=df.set_index('RefMain')['Name'].to_dict())
    
    
    # merge the path list to the dataframe
    df = df.merge(pd.DataFrame(nx.shortest_path(G)).T['null'], 
                  left_on='Name', 
                  right_index=True)
    
    # new column:
    df['Path'] = df['null'].apply(lambda x: '/'.join(x[-2::-1]) )
    

    Output:

       Name RefSecondary RefMain                         null              Path
    0  test            2       3       [test, get, set, null]      set/get/test
    1   bet            3       4  [bet, test, get, set, null]  set/get/test/bet
    2   get            1       2             [get, set, null]           set/get
    3   set         null       1                  [set, null]               set
    4   net            3       5  [net, test, get, set, null]  set/get/test/net