Search code examples
pythonpandasdataframenetworkx

Creating hierarchy using 4 columns in dataframe - pandas


Dataframe is below

    ID        ParentID   Filter Text
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

I need to add an additional column hierarchy like this:

    ID        ParentID   Filter Text   Hierarchy
0  98           97       NULL   AA
1  99            98      NULL   BB
2  100           99      NULL   CC
3  107           100     1      DD      DD/CC/BB/AA/LL
4  9999        1231     NULL   EE
5  10000        1334    NULL    FF
6  10001        850     2       GG      GG/HH/II/JJ/KK
7   850          230    NULL    HH
8   230          121    NULL    II
9   121          96     NULL    JJ
10 96            0      NULL    KK
11 97            0      NULL    LL

The rules I am looking at are below:

  1. Only populate hierarchy column for rows which have filter value populated, the rest of the rows don't need hierarchy done.

  2. When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.

  3. Trying to do this with itertools but the looping is taking too long as the original dataset is huge

4)Recordset size is ~ 200k

The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?


Solution

  • Here is a solution involving one pass of dfs for each root node. The worst time complexity is O(V + FV) where V is the number of columns and F is the number of columns to populate hierarchy for. This might be faster than other solutions as it exploits the fact that the given graph is a tree and hence there is only one path from root to any node.

    # this is a recursive dfs code with additional logic to store the hierarchy
    # of interesting nodes
    def dfs(graph, stack, interesting, hierarchy):
        node = stack[-1]
        for child in graph[node]:
            stack.append(child)
            if child in interesting:
                hierarchy[child] = stack[:]
            dfs(graph, stack, interesting, hierarchy)
        stack.pop()
    
    
    # make 'ID' the index
    df = df.set_index("ID")
    
    # find the roots
    roots = df[df["ParentID"] == 0].index
    # interesting nodes to find the hierarchy for
    interesting = set(df[df["Filter"].notna()].index)
    # dict to store the id -> hierarchy mapping
    hierarchy = {}
    
    # build a directed graph in adjacency list of parent -> children
    graph = defaultdict(list)
    for node, parent in zip(df.index, df["ParentID"]):
        graph[parent].append(node)
    
    # run dfs on each root
    for root in roots:
        stack = [root]
        dfs(graph, stack, interesting, hierarchy)
    
    # populate the hierarchy column
    df["Hierarchy"] = ""
    for node, path in hierarchy.items():
        df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
    
    # make 'ID' a column again
    df = df.reset_index()
    
    # now we're done!
    print(df)
    

    Full code is in https://pastebin.com/6MFaqZQw.