Search code examples
pythonpandasrecursiontreedescendant

Find all descendants for points in Python


I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:

df1:
side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

df2:
side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

The point is to get all points for each side_a value until reach end_point from df2 for that value. If it has two end_point values (like "k" does) that it should be two lists.

I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points'] and that causes certain problems. But if someone wants me to post the code I will, of course.

The desired output would be something like this:

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.

Any help or hint is welcomed here, truly. Thanks in advance.


Solution

  • You can use networkx library and graphs:

    import networkx as nx
    G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
    df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                         nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)
    

    Output:

      side_a  end_point
    0      a     [b, c]
    1      b        [c]
    2      c         []
    3      k     [l, m]
    4      k     [l, n]
    5      l        [m]
    6      l        [n]
    7      p  [q, r, s]
    8      q     [r, s]
    9      r        [s]