Search code examples
pythontree

Is there a way in Python to identify all possible relationships back up to a single parent


I'm currently working through a problem within the dataset (stored as a dataframe) I have that I'm not sure how to solve for. One of the tables in the dataset shows all relationships for a given ID. But there are multiple instances where a child of a parent will have other relationships as well as the grandchild having its own set and so on... What I want to achieve is a single table that shows all possible matches for a given ID.

Example of Original Table

Parent Child
1 2
1 3
2 1
2 4
3 2

Final Result

Parent Child
1 2
1 3
1 4
2 1
2 3
2 4
3 1
3 2
3 4

Is this the right way to structure the data? the true dataset has 15k rows so I think it will be a lot larger after this operation.

Cheers

I've looked into tree structures but not 100% sure if that's what's needed.


Solution

  • With df your dataframe you could try the following:

    def descendants(childs, source):
        descendants = set()
        stack = [source]
        while stack:
            descendant = stack.pop()
            if descendant not in childs:
                continue
            for child in childs[descendant]:
                if child != source and child not in descendants:
                    stack.append(child)
                    descendants.add(child)
        return descendants
    
    childs = {parent: set(childs) for parent, childs in df.groupby("Parent")["Child"]}
    
    df_descendants = pd.DataFrame(
        [(parent, descendants(childs, parent)) for parent in df["Parent"].unique()],
        columns=["Parent", "Descendant"]
    ).explode("Descendant", ignore_index=True).sort_values(["Parent", "Descendant"])
    

    Or you could just use networkx:

    import networkx as nx
    
    G = nx.from_pandas_edgelist(df, source="Parent", target="Child")
    df_descendants = pd.DataFrame(
        [(parent, nx.descendants(G, parent)) for parent in df["Parent"].unique()],
        columns=["Parent", "Descendant"]
    ).explode("Descendant", ignore_index=True).sort_values(["Parent", "Descendant"])
    

    Result for the sample (both alternatives):

       Parent Descendant
    0       1          2
    1       1          3
    2       1          4
    3       2          1
    4       2          3
    5       2          4
    6       3          1
    7       3          2
    8       3          4
    

    If you want to include the distance of the descendants from their ancestors then you have to switch from a DFS to a BFS (by using a deque with .appendleft for stack):

    from collections import deque
    
    def descendants(childs, source):
        descendants = {}
        stack = deque([(source, 0)])
        while stack:
            descendant, distance = stack.pop()
            distance += 1
            if descendant not in childs:
                continue
            for child in childs[descendant]:
                if child != source and child not in descendants:
                    stack.appendleft((child, distance))
                    descendants[child] = distance
        return descendants
    
    childs = {
        parent: set(childs) for parent, childs in df.groupby("Parent")["Child"]
    }
    
    df_descendants = pd.DataFrame(
        [
            (parent, descendant, distance)
            for parent in df["Parent"].unique()
            for descendant, distance in descendants(childs, parent).items()
        ],
        columns=["Parent", "Descendant", "Distance"]
    ).sort_values(["Parent", "Distance", "Descendant"])
    

    Result for the sample:

       Parent  Descendant  Distance
    0       1           2         1
    1       1           3         1
    2       1           4         2
    3       2           1         1
    4       2           4         1
    5       2           3         2
    6       3           2         1
    7       3           1         2
    8       3           4         2