I'm currently working through a problem within the dataset (stored as a dataframe) I have that I'm not sure how to solve for. One of the tables in the dataset shows all relationships for a given ID. But there are multiple instances where a child of a parent will have other relationships as well as the grandchild having its own set and so on... What I want to achieve is a single table that shows all possible matches for a given ID.
Example of Original Table
Parent | Child |
---|---|
1 | 2 |
1 | 3 |
2 | 1 |
2 | 4 |
3 | 2 |
Final Result
Parent | Child |
---|---|
1 | 2 |
1 | 3 |
1 | 4 |
2 | 1 |
2 | 3 |
2 | 4 |
3 | 1 |
3 | 2 |
3 | 4 |
Is this the right way to structure the data? the true dataset has 15k rows so I think it will be a lot larger after this operation.
Cheers
I've looked into tree structures but not 100% sure if that's what's needed.
With df
your dataframe you could try the following:
def descendants(childs, source):
descendants = set()
stack = [source]
while stack:
descendant = stack.pop()
if descendant not in childs:
continue
for child in childs[descendant]:
if child != source and child not in descendants:
stack.append(child)
descendants.add(child)
return descendants
childs = {parent: set(childs) for parent, childs in df.groupby("Parent")["Child"]}
df_descendants = pd.DataFrame(
[(parent, descendants(childs, parent)) for parent in df["Parent"].unique()],
columns=["Parent", "Descendant"]
).explode("Descendant", ignore_index=True).sort_values(["Parent", "Descendant"])
Or you could just use networkx:
import networkx as nx
G = nx.from_pandas_edgelist(df, source="Parent", target="Child")
df_descendants = pd.DataFrame(
[(parent, nx.descendants(G, parent)) for parent in df["Parent"].unique()],
columns=["Parent", "Descendant"]
).explode("Descendant", ignore_index=True).sort_values(["Parent", "Descendant"])
Result for the sample (both alternatives):
Parent Descendant
0 1 2
1 1 3
2 1 4
3 2 1
4 2 3
5 2 4
6 3 1
7 3 2
8 3 4
If you want to include the distance of the descendants from their ancestors then you have to switch from a DFS to a BFS (by using a deque
with .appendleft
for stack
):
from collections import deque
def descendants(childs, source):
descendants = {}
stack = deque([(source, 0)])
while stack:
descendant, distance = stack.pop()
distance += 1
if descendant not in childs:
continue
for child in childs[descendant]:
if child != source and child not in descendants:
stack.appendleft((child, distance))
descendants[child] = distance
return descendants
childs = {
parent: set(childs) for parent, childs in df.groupby("Parent")["Child"]
}
df_descendants = pd.DataFrame(
[
(parent, descendant, distance)
for parent in df["Parent"].unique()
for descendant, distance in descendants(childs, parent).items()
],
columns=["Parent", "Descendant", "Distance"]
).sort_values(["Parent", "Distance", "Descendant"])
Result for the sample:
Parent Descendant Distance
0 1 2 1
1 1 3 1
2 1 4 2
3 2 1 1
4 2 4 1
5 2 3 2
6 3 2 1
7 3 1 2
8 3 4 2