Search code examples
pythonfor-looprecursiondepth-first-searchdirected-acyclic-graphs

Given the root, how to find the ends of a tree/directed acyclic graph based on condition in Python


Details of my problem:

I have a dataset that looks like:

Child Parent Ownership  Parent_Country
  A     B      0.1      Foreign
  A     C      0.3      Domestic
  A     D      0.6      Domestic
  B     E      0.4      Domestic
  B     F      0.6      Foreign
  H     J       1       Foreign
  C     G      0.9      Foreign
  A     G      0.05     Foreign
  D     H      0.8      Foreign
  D     I      0.2      Domestic

Note that the data structure is a tree/DAG structure where A->B->E, A->B->F, etc. A is the ultimate Child owned by multiple parents who are owned by multiple parents.

What I want to do is to output just the Parent(s) that meets the following criteria: 1. Owns 30% of A or more directly and/or indirectly 2. Is foreign. 3. This foreign parent needs to be the first level foreign parent that meets the ownership criteria, and any foreign parent that owns that foreign parent and still meets the ownership criteria does not count

Desired output:

{'A':[('G',0.33),('H',0.48)]}

Note that in this case the parents that meet this criteria is G and H, because H is foreign and owns 80% of D who owns 60% of A (0.80.6=0.48), and G is foreign and owns 90% of C who owns 30% of A, plus G owns 5% of A directly (0.90.3+0.05=0.33). Also note, that even though J owns 100% of H so should also own 48% of A, but because H is closer to A, J does not meet the criteria.

What I have tried:

Based on a similar question I asked, I understand that my dataframe represents a weighted directed acyclic graph, where Ultimate Child A is the root. I first created a dict appending all the parents, ownership, and country to each child, but then I'm a bit stuck. I tried using dfs to calculate all the ultimate ownership percentages in A for every parent, and output all the parents of A that fit the criteria. However, I met the following issues:

  1. I can't keep only the root A as the key in the output dictionary and append multiple rows containing parents & ownership pairs to A. I tried initializing the key outside of the for loop but got an error
  2. I want the loop to break when it hits the first parent that meets the criteria for that path, but for other paths to keep going.
  3. Maybe I just need to use a totally different method getting to the end result. Please heelpp

See below the code I used:

import pandas as pd
from collections import defaultdict

df=pd.DataFrame({'Child': list('AAABBHCADD'), 'Parent': list('BCDEFJGGHI'), 'Ownership': [.1,.3,.6,.4,.6,1,.9,.05,.8,.2],'Parent_Country':('Foreign','Domestic','Domestic','Domestic','Foreign','Foreign','Foreign','Foreign','Foreign','Domestic')})

#Create a dict appending all the parents to each child
dag_as_dict = defaultdict(list) # {child: list of (parent, weight)}
for idx, row in df.iterrows():
    dag_as_dict[row['Child']].append((row['Parent'], row['Ownership'],row['Parent_Country']))

#Calculate ultimate ownership in A for every parent
def get_weights_of_descendants(dag_as_dict, root, root_weight=1.0, country='Domestic', result=None,real_parent=None):
    if result is None:
        result = defaultdict(float)
    if real_parent is None:
        real_parent=defaultdict(str)
    for parent, weight, country in dag_as_dict[root]:
        new_weight=weight*root_weight
        result[parent] += new_weight
        if result[parent]>0.3 and country!='Domestic':
            real_parent[root]=parent
            break
        get_weights_of_descendants(dag_as_dict, parent, new_weight, country,result,real_parent)            
    return real_parent
#    return result

Solution

  • I would use networkx.

    First let's see how to get all ownership without filtering:

    import networkx as nx
    import numpy as np
    
    G = nx.from_pandas_edgelist(df.rename(columns={'Ownership': 'weight'}),
                                source='Parent', target='Child',
                                create_using=nx.DiGraph, edge_attr='weight')
    
    
    foreign = set(df.loc[df['Parent_Country'].eq('Foreign'), 'Parent'])
    # {'B', 'F', 'G', 'H', 'J'}
    
    paths = {f: sum(np.product([G.get_edge_data(*e)['weight'] for e in p])
                     for p in nx.all_simple_edge_paths(G, f, 'A'))
             for f in foreign
            }
    

    Output:

    {'J': 0.48, 'G': 0.32, 'B': 0.1, 'F': 0.06, 'H': 0.48}
    

    Now if you want to only keep those ≥ 30% and not having a foreign parent, you can add a step to remove the parents that own a foreign company and filter based on a threshold:

    import networkx as nx
    import numpy as np
    
    G = nx.from_pandas_edgelist(df.rename(columns={'Ownership': 'weight'}),
                                source='Parent', target='Child',
                                create_using=nx.DiGraph, edge_attr='weight')
    
    
    foreign = set(df.loc[df['Parent_Country'].eq('Foreign'), 'Parent'])
    # {'B', 'F', 'G', 'H', 'J'}
    
    paths = {f: total
             for f in foreign
             if (total:=
                 sum(np.product([G.get_edge_data(*e)['weight'] for e in p])
                     for p in nx.all_simple_edge_paths(G, f, 'A'))
                ) >= 0.3
            }
    
    # remove foreign parents of a foreign company:
    for n in list(paths):
        for s in G.predecessors(n):
            if s in paths:
                del paths[s]
    

    Output:

    {'G': 0.32, 'H': 0.48}
    

    Graph (red nodes are foreign, edge width proportional to the % ownership):

    graph of companies tree

    input:

    df = pd.DataFrame({'Child': ['A', 'A', 'A', 'B', 'B', 'H', 'C', 'A', 'D', 'D'],
                       'Parent': ['B', 'C', 'D', 'E', 'F', 'J', 'G', 'G', 'H', 'I'],
                       'Ownership': [0.1, 0.3, 0.6, 0.4, 0.6, 1.0, 0.9, 0.05, 0.8, 0.2],
                       'Parent_Country': ['Foreign', 'Domestic', 'Domestic', 'Domestic', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Domestic']})
    
    code to plot the graph

    This requires graphviz.

    import pandas as pd
    import networkx as nx
    from networkx.drawing.nx_agraph import graphviz_layout, to_agraph
    
    df = pd.DataFrame({'Child': ['A', 'A', 'A', 'B', 'B', 'H', 'C', 'A', 'D', 'D'],
                       'Parent': ['B', 'C', 'D', 'E', 'F', 'J', 'G', 'G', 'H', 'I'],
                       'Ownership': [0.1, 0.3, 0.6, 0.4, 0.6, 1.0, 0.9, 0.05, 0.8, 0.2],
                       'Parent_Country': ['Foreign', 'Domestic', 'Domestic', 'Domestic', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Domestic']})
    
    foreign = set(df.loc[df['Parent_Country'].eq('Foreign'), 'Parent'])
    # {'B', 'F', 'G', 'H', 'J'}
    
    G = nx.from_pandas_edgelist(df.rename(columns={'Ownership': 'penwidth'})
                                  .assign(label=lambda d: d['penwidth'],
                                          penwidth=lambda d: d['penwidth'].mul(5),
                                         ),
                                source='Parent', target='Child',
                                create_using=nx.DiGraph,
                                edge_attr=['penwidth', 'label'])
    
    nx.set_node_attributes(G, {n: {'color': 'red'} for n in foreign})
    
    A = to_agraph(G)
    A.layout('dot')
    A.draw('graph.png')