I have a dataset that looks like:
Child Parent Ownership Parent_Country
A B 0.1 Foreign
A C 0.3 Domestic
A D 0.6 Domestic
B E 0.4 Domestic
B F 0.6 Foreign
H J 1 Foreign
C G 0.9 Foreign
A G 0.05 Foreign
D H 0.8 Foreign
D I 0.2 Domestic
Note that the data structure is a tree/DAG structure where A->B->E, A->B->F, etc. A is the ultimate Child owned by multiple parents who are owned by multiple parents.
What I want to do is to output just the Parent(s) that meets the following criteria: 1. Owns 30% of A or more directly and/or indirectly 2. Is foreign. 3. This foreign parent needs to be the first level foreign parent that meets the ownership criteria, and any foreign parent that owns that foreign parent and still meets the ownership criteria does not count
Desired output:
{'A':[('G',0.33),('H',0.48)]}
Note that in this case the parents that meet this criteria is G and H, because H is foreign and owns 80% of D who owns 60% of A (0.80.6=0.48), and G is foreign and owns 90% of C who owns 30% of A, plus G owns 5% of A directly (0.90.3+0.05=0.33). Also note, that even though J owns 100% of H so should also own 48% of A, but because H is closer to A, J does not meet the criteria.
Based on a similar question I asked, I understand that my dataframe represents a weighted directed acyclic graph, where Ultimate Child A is the root. I first created a dict appending all the parents, ownership, and country to each child, but then I'm a bit stuck. I tried using dfs to calculate all the ultimate ownership percentages in A for every parent, and output all the parents of A that fit the criteria. However, I met the following issues:
See below the code I used:
import pandas as pd
from collections import defaultdict
df=pd.DataFrame({'Child': list('AAABBHCADD'), 'Parent': list('BCDEFJGGHI'), 'Ownership': [.1,.3,.6,.4,.6,1,.9,.05,.8,.2],'Parent_Country':('Foreign','Domestic','Domestic','Domestic','Foreign','Foreign','Foreign','Foreign','Foreign','Domestic')})
#Create a dict appending all the parents to each child
dag_as_dict = defaultdict(list) # {child: list of (parent, weight)}
for idx, row in df.iterrows():
dag_as_dict[row['Child']].append((row['Parent'], row['Ownership'],row['Parent_Country']))
#Calculate ultimate ownership in A for every parent
def get_weights_of_descendants(dag_as_dict, root, root_weight=1.0, country='Domestic', result=None,real_parent=None):
if result is None:
result = defaultdict(float)
if real_parent is None:
real_parent=defaultdict(str)
for parent, weight, country in dag_as_dict[root]:
new_weight=weight*root_weight
result[parent] += new_weight
if result[parent]>0.3 and country!='Domestic':
real_parent[root]=parent
break
get_weights_of_descendants(dag_as_dict, parent, new_weight, country,result,real_parent)
return real_parent
# return result
I would use networkx
.
First let's see how to get all ownership without filtering:
import networkx as nx
import numpy as np
G = nx.from_pandas_edgelist(df.rename(columns={'Ownership': 'weight'}),
source='Parent', target='Child',
create_using=nx.DiGraph, edge_attr='weight')
foreign = set(df.loc[df['Parent_Country'].eq('Foreign'), 'Parent'])
# {'B', 'F', 'G', 'H', 'J'}
paths = {f: sum(np.product([G.get_edge_data(*e)['weight'] for e in p])
for p in nx.all_simple_edge_paths(G, f, 'A'))
for f in foreign
}
Output:
{'J': 0.48, 'G': 0.32, 'B': 0.1, 'F': 0.06, 'H': 0.48}
Now if you want to only keep those ≥ 30% and not having a foreign parent, you can add a step to remove the parents that own a foreign company and filter based on a threshold:
import networkx as nx
import numpy as np
G = nx.from_pandas_edgelist(df.rename(columns={'Ownership': 'weight'}),
source='Parent', target='Child',
create_using=nx.DiGraph, edge_attr='weight')
foreign = set(df.loc[df['Parent_Country'].eq('Foreign'), 'Parent'])
# {'B', 'F', 'G', 'H', 'J'}
paths = {f: total
for f in foreign
if (total:=
sum(np.product([G.get_edge_data(*e)['weight'] for e in p])
for p in nx.all_simple_edge_paths(G, f, 'A'))
) >= 0.3
}
# remove foreign parents of a foreign company:
for n in list(paths):
for s in G.predecessors(n):
if s in paths:
del paths[s]
Output:
{'G': 0.32, 'H': 0.48}
Graph (red nodes are foreign, edge width proportional to the % ownership):
input:
df = pd.DataFrame({'Child': ['A', 'A', 'A', 'B', 'B', 'H', 'C', 'A', 'D', 'D'],
'Parent': ['B', 'C', 'D', 'E', 'F', 'J', 'G', 'G', 'H', 'I'],
'Ownership': [0.1, 0.3, 0.6, 0.4, 0.6, 1.0, 0.9, 0.05, 0.8, 0.2],
'Parent_Country': ['Foreign', 'Domestic', 'Domestic', 'Domestic', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Domestic']})
This requires graphviz
.
import pandas as pd
import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout, to_agraph
df = pd.DataFrame({'Child': ['A', 'A', 'A', 'B', 'B', 'H', 'C', 'A', 'D', 'D'],
'Parent': ['B', 'C', 'D', 'E', 'F', 'J', 'G', 'G', 'H', 'I'],
'Ownership': [0.1, 0.3, 0.6, 0.4, 0.6, 1.0, 0.9, 0.05, 0.8, 0.2],
'Parent_Country': ['Foreign', 'Domestic', 'Domestic', 'Domestic', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Domestic']})
foreign = set(df.loc[df['Parent_Country'].eq('Foreign'), 'Parent'])
# {'B', 'F', 'G', 'H', 'J'}
G = nx.from_pandas_edgelist(df.rename(columns={'Ownership': 'penwidth'})
.assign(label=lambda d: d['penwidth'],
penwidth=lambda d: d['penwidth'].mul(5),
),
source='Parent', target='Child',
create_using=nx.DiGraph,
edge_attr=['penwidth', 'label'])
nx.set_node_attributes(G, {n: {'color': 'red'} for n in foreign})
A = to_agraph(G)
A.layout('dot')
A.draw('graph.png')