We've a table that contains an Id, and on the same row, a reference to another Id in the same table. The Id record was infected by the referenced Id record. The referenced Id itself may or may not have a reference to another Id, it may not exist, or it may become a circular reference (linking back upon itself). Put into pandas, the problem looks a bit like this:
import pandas as pd
import numpy as np
# example data frame
inp = [{'Id': 1, 'refId': np.nan},
{'Id': 2, 'refId': 1},
{'Id': 3, 'refId': 2},
{'Id': 4, 'refId': 3},
{'Id': 5, 'refId': np.nan},
{'Id': 6, 'refId': 7},
{'Id': 7, 'refId': 20},
{'Id': 8, 'refId': 9},
{'Id': 9, 'refId': 8},
{'Id': 10, 'refId': 8}
]
df = pd.DataFrame(inp)
print(df.dtypes)
What I am trying to do is count of how far back the references go for each row in the table. The logic would:
Results from example should look like:
Id RefId Result
1 - 0
2 1 1
3 2 2
4 3 3
5 - 0
6 7 2
7 20 1
8 9 1
9 8 1
10 8 2
Every approach I've tried ends up needed a new column for each reference to a reference, but the table is quite enourmous, and I'm not sure how long the daisy-chain of internal table references will ultimately be. I'm hoping there might be a better way, that isn't too difficult for me to learn.
This is a graph problem, so you can use networkx
.
Convert your dataframe to directed graph:
import networkx as nx
G = nx.from_pandas_edgelist(df.fillna(-1).astype(int),
source='Id', target='refId', # source -> target
create_using=nx.DiGraph() # directed graph
)
# removing the NaN (replaced by "-1" for enabling indexing)
G.remove_node(-1)
This give this graph:
Then simply count the children:
nodes = {n: len(nx.descendants(G,n)) for n in G.nodes}
df['Result'] = df['Id'].map(lambda x: nodes.get(x, 0))
output:
Id refId Result
0 1 NaN 0
1 2 1.0 1
2 3 2.0 2
3 4 3.0 3
4 5 NaN 0
5 6 7.0 2
6 7 20.0 1
7 8 9.0 1
8 9 8.0 1
9 10 8.0 2
NB. the result is a bit different, so maybe I did not fully grasp your logic, but this gives you the general idea. Please elaborate on the logic.