Dataframe is below
ID ParentID Filter Text
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
I need to add an additional column hierarchy like this:
ID ParentID Filter Text Hierarchy
0 98 97 NULL AA
1 99 98 NULL BB
2 100 99 NULL CC
3 107 100 1 DD DD/CC/BB/AA/LL
4 9999 1231 NULL EE
5 10000 1334 NULL FF
6 10001 850 2 GG GG/HH/II/JJ/KK
7 850 230 NULL HH
8 230 121 NULL II
9 121 96 NULL JJ
10 96 0 NULL KK
11 97 0 NULL LL
The rules I am looking at are below:
Only populate hierarchy column for rows which have filter value populated, the rest of the rows don't need hierarchy done.
When a row is found having filter value not null, lookup its parentID, then search this parentid in ID column. When found reclusively keep going up till, parent id is 0.
Trying to do this with itertools but the looping is taking too long as the original dataset is huge
4)Recordset size is ~ 200k
The below solution provided kindly by mozway seems to work but for a recorset of 200k records, it takes a lot of time. Is there a tweak that can be done to this to get to the solution faster ?
Here is a solution involving one pass of dfs for each root node.
The worst time complexity is O(V + FV)
where V
is the number of columns and F
is
the number of columns to populate hierarchy for.
This might be faster than other solutions as it exploits the fact that the
given graph is a tree and hence there is only one path from root to any node.
# this is a recursive dfs code with additional logic to store the hierarchy
# of interesting nodes
def dfs(graph, stack, interesting, hierarchy):
node = stack[-1]
for child in graph[node]:
stack.append(child)
if child in interesting:
hierarchy[child] = stack[:]
dfs(graph, stack, interesting, hierarchy)
stack.pop()
# make 'ID' the index
df = df.set_index("ID")
# find the roots
roots = df[df["ParentID"] == 0].index
# interesting nodes to find the hierarchy for
interesting = set(df[df["Filter"].notna()].index)
# dict to store the id -> hierarchy mapping
hierarchy = {}
# build a directed graph in adjacency list of parent -> children
graph = defaultdict(list)
for node, parent in zip(df.index, df["ParentID"]):
graph[parent].append(node)
# run dfs on each root
for root in roots:
stack = [root]
dfs(graph, stack, interesting, hierarchy)
# populate the hierarchy column
df["Hierarchy"] = ""
for node, path in hierarchy.items():
df.loc[node, "Hierarchy"] = "/".join(df.loc[path, "Text"])
# make 'ID' a column again
df = df.reset_index()
# now we're done!
print(df)
Full code is in https://pastebin.com/6MFaqZQw.