Search code examples
python-3.xpandas

Dataframe with FROM and TO column. Unordered, Get correct combinations


df = pd.DataFrame({
    'from': ['Start', '21', '73', 'Start', '55', '1', '2', '3'],
    'to': ['21', '73', '55', '1', '54', '2', '3', '4']
})
from to
0 Start 21
1 21 73
2 73 55
3 Start 1
4 55 54
5 1 2
6 2 3
7 3 4

desired output

[['Start', '21', '73', '55', '54'], ['Start', '1', '2', '3', '4']]

[EDIT]

So you begin with "Start" in the "from" column and find "21" in the "to" column of that row. You have to find "21" in the "from" column now and move on to "73". Then repeat the steps, until you can't find anything in "from", which indicates the connection is done.

The connections are acyclic and unbranched.

[END EDIT]

the rows don't have to be in the correct order, as for 55, 54.


Solution

  • This is a graph problem, you could use networkx to solve it.

    First de-duplicate the Start, then create a directed graph, get the weakly_connected_components, then the dag_longest_path:

    import networkx as nx
    
    # deduplicate the Start
    m = df['from'].eq('Start')
    from_ = df['from'].mask(m, 'Start'+m.cumsum().astype(str))
    # ['Start1', '21', '73', 'Start2', '55', '1', '2', '3']
    
    # create a graph
    G = nx.from_edgelist(zip(from_, df['to']),
                         create_using=nx.DiGraph)
    
    # get isolated paths
    out = [nx.dag_longest_path(G.subgraph(c))
           for c in nx.weakly_connected_components(G)]
    

    Output:

    [['Start1', '21', '73', '55', '54'], ['Start2', '1', '2', '3', '4']]
    

    Graph:

    pandas edge pairs in order networkx graph

    You could also use a python function to manually follow the path for each Start:

    def find_path(df, frm='from', to='to'):
        # create a Series with from as index, to as values
        s = df.set_index(frm)[to]
        out = []
        # for each Start, follow the path
        for val in s[s.index=='Start']:
            out.append(['Start', val])
            while (val := s.get(val)) is not None:
                out[-1].append(val)
        return out
    
    out = find_path(df)
    # [['Start', '21', '73', '55', '54'], ['Start', '1', '2', '3', '4']]
    

    NB. this is assuming the graphs are non cyclic, non branched. If this is the case you can detect the cycles by keeping a set of seen nodes. The branches can be handled as well, but more details on the expected logic are needed.