Search code examples
pythonpandasreferencenetworkxgraph-theory

Is there a way to count the length of a daisy chain of internal table references using Pandas / Python?


We've a table that contains an Id, and on the same row, a reference to another Id in the same table. The Id record was infected by the referenced Id record. The referenced Id itself may or may not have a reference to another Id, it may not exist, or it may become a circular reference (linking back upon itself). Put into pandas, the problem looks a bit like this:

import pandas as pd
import numpy as np
# example data frame
inp = [{'Id': 1, 'refId': np.nan},
   {'Id': 2, 'refId': 1},
   {'Id': 3, 'refId': 2},
   {'Id': 4, 'refId': 3}, 
   {'Id': 5, 'refId': np.nan},
   {'Id': 6, 'refId': 7},
   {'Id': 7, 'refId': 20},
   {'Id': 8, 'refId': 9},
   {'Id': 9, 'refId': 8}, 
   {'Id': 10, 'refId': 8}
   ]
df = pd.DataFrame(inp)
print(df.dtypes)

What I am trying to do is count of how far back the references go for each row in the table. The logic would:

  • Starting with Result = 0 for each row:
  • If a Ref-Id is not nan, then add 1,
  • If the referenced-Id exists, and this referenced-Id has a reference, and the referenced-Id reference is not a back-reference, add 1 to the Result, then repeat this step until one of the conditions is NOT met, then go to Else;
  • Else (no reference-Id, no reference for the referenced-Id, or
    reference loops back to a previous reference), return the Result.

Results from example should look like:

Id  RefId  Result
1     -      0
2     1      1
3     2      2
4     3      3
5     -      0
6     7      2
7     20     1
8     9      1
9     8      1
10    8      2

Every approach I've tried ends up needed a new column for each reference to a reference, but the table is quite enourmous, and I'm not sure how long the daisy-chain of internal table references will ultimately be. I'm hoping there might be a better way, that isn't too difficult for me to learn.


Solution

  • This is a graph problem, so you can use networkx.

    Convert your dataframe to directed graph:

    import networkx as nx
    
    G = nx.from_pandas_edgelist(df.fillna(-1).astype(int),
                                source='Id', target='refId',   # source -> target
                                create_using=nx.DiGraph()      # directed graph
                                )
    
    # removing the NaN (replaced by "-1" for enabling indexing)
    G.remove_node(-1)
    

    This give this graph:

    graph

    Then simply count the children:

    nodes = {n: len(nx.descendants(G,n)) for n in G.nodes}
    
    df['Result'] = df['Id'].map(lambda x: nodes.get(x, 0))
    

    output:

       Id  refId  Result
    0   1    NaN       0
    1   2    1.0       1
    2   3    2.0       2
    3   4    3.0       3
    4   5    NaN       0
    5   6    7.0       2
    6   7   20.0       1
    7   8    9.0       1
    8   9    8.0       1
    9  10    8.0       2
    

    NB. the result is a bit different, so maybe I did not fully grasp your logic, but this gives you the general idea. Please elaborate on the logic.