Search code examples
pythonalgorithmnetworkx

How to get connection from every leaf to every node in all paths from leaf to root in Networkx


So I have been working on this problem for a little while now. I asked this question before, but in a context of only using python. I got great answer from trincot in this thread(also the problem is described in detail here):
Find all paths from leafs to each node in a forest
The problem that I run into when using it, is that, It's too slow. It works very well for files with about 300 rows, but mine has 11k. When I tried to run it, it took 95% of my 16 GB RAM and didn't return the result even after 3 days of compiling. So I thought about using NetworkX for this because I read that It's the best lib for graphs. The data will create multiple trees which may not be connected with each other and also there are no multigraphs. Right now my code only takes the input from a file where data is stored like this:

Parent Child
Parent Child
Parent Child
...

So let's take example data from my previous thread:

Parent             child
ANALYTICAL_BALANCE BFG_DEPOSIT 
CUSTOMER_DETAIL BALANCE 
BFG_2056 FFD_15 
BALANCE BFG_16 
BFG_16 STAT_HIST 
ANALYTICAL_BALANCE BFG_2056 
CUSTOM_DATA AND_11 
AND_11 DICT_DEAL 
DICT_DEAL BFG_2056 

Visual Example

I load the data into pandas dataframe and delete the connections where Parent and Child are the same:

data = pd.read_csv('data.txt', sep=" ", header=0)
data = data[data['child'] != data['Parent]]

then I create a DiGraph from my dataframe:

G = nx.from_pandas_edgelist(data, source = 'Parent', target = 'child', create_using = nx.DiGraph())

And I try to sift for a roots and leaves:

roots = (v for v, d in G.in_degree() if d==0)
leaves = (v for v, d in G.out_degree() if d ==0)

aaaaaand I'm stuck. I have two ideas. Take every possible path from leaves to roots and then only print leaf -> node connection starting from leaf to root or take every node, check if it exists in a path from leaves to roots, take It's leaf and print leaf -> node. The output for the example data will be like this:

ANALYTICAL_BALANCE BFG_DEPOSIT

ANALYTICAL_BALANCE BFG_2056
ANALYTICAL_BALANCE FFD_15
CUSTOM_DATA AND_11
CUSTOM_DATA DICT_DEAL
CUSTOM_DATA BFG_2056
CUSTOM_DATA FFD_15

CUSTOMER_DETAIL BALANCE
CUSTOMER_DETAIL BFG_16
CUSTOMER_DETAIL STAT_HIST

If you maybe have any other ideas on how I can solve this or what library may be better for this feel free to write. Any help would be greatly appreciated!


Solution

  • Ok so I guess that the answer to my problem was really simple. The whole code looks like this:

    data = pd.read_csv('data.txt', sep=" ", header=0)
    data = data[data['child'] != data['Parent']]
    G = nx.from_pandas_edgelist(data, source = 'Parent', target = 'child', create_using = nx.DiGraph())
    leaves = list((v for v, d in G.in_degree() if d==0))
    nodes = G.nodes()
    for leaf in leaves:
        for node in nodes:
            if nx.has_path(G,leaf, node) and node not in leaves:
                print(leaf, node)
    

    Idea behind this code is really simple. I just take every leaf from a forest and check if there exists a path from leaf to every node. Ofcourse the leafs are also nodes so when I see one I pass it. It may be a little naive, but works great.