Search code examples
pythonpandaspysparkhierarchy

Finding ultimate parent


I am trying to find the ultimate parent with Dir pandas. But the task has one specialty where the graph doesn't really fit, or I simply don't know how to use it correctly. Input:

Child Parent Class
1001 8888 A
1001 1002 D
1001 1002 C
1001 1003 C
1003 6666 G
1002 9999 H

Output:

Child Ultimate_Parent Class Connection
1001 8888 A Direct
1001 9999 D Indirect
1001 9999 C Indirect
1001 6666 C Indirect
1003 6666 G Direct
1002 9999 H Direct

I do:

import pandas as pd 
import networx as nx 
df = pd.DataFrame({'Child': ['1001', '1001', '1001', '1001', '1003', '1004'], 'Parent': ['8888', '1002', '1002', '1003', '6666', '9999'],'Class': ['A','D','C','C','G','H']})
    def get_hierarchy (df):
        DiG=nx.from_pandas_adgelist (df,'child','parent',create_using=nx.DiGraph())
        return pd.DataFrame.from_records([(n1,n2) for n1 in DiG.nodes() for n2 in nx.ancestors(DiG, n1)], columns=['child','Ultimate_parent'])
    df=df.toPandas()
    df=get_hierarchy(df)
    return df

And I can't get how to use Class attribute here, to show twice 1001 with D and C classes.


Solution

  • Use G.predecessors to detect if the current Parent is a root of the tree or not. If yes, the connection is Direct else the connection is Indirect.

    G = nx.from_pandas_edgelist(df, source='Parent', target='Child',
                                create_using=nx.DiGraph)
    
    roots = [node for node, degree in G.in_degree() if degree == 0]
    
    ultimate_parent = [node if node in roots else list(G.predecessors(node))[0] 
                           for node in df['Parent']]
    
    df['Ultimate_Parent'] = ultimate_parent
    df['Connection'] = np.where(df['Parent'] == df['Ultimate_Parent'],
                                'Direct', 'Indirect')
    

    Output:

    >>> df
       Child  Parent Class  Ultimate_Parent Connection
    0   1001    8888     A             8888     Direct
    1   1001    1002     D             9999   Indirect
    2   1001    1002     C             9999   Indirect
    3   1001    1003     C             6666   Indirect
    4   1003    6666     G             6666     Direct
    5   1002    9999     H             9999     Direct