Search code examples
data-miningnetworkxjungspark-graphxlarge-data

Simple path queries on large graphs


I have a question about large graph data. Suppose that we have a large graph with nearly 100 million edges and around 5 million nodes, in this case what is the best graph mining platform that you know of that can give all simple paths of lengths <=k (for k=3,4,5) between any two given nodes. The main concern is the speed of getting those paths. Another thing is that the graph is directed, but we would like the program to ignore the directions when computing the paths but still return the actually directed edges once it spots those paths.

For instance:

a -> c <- d -> b is a valid path between nodes 'a' and 'b' of length 3.

Thanks in advance.


Solution

  • So this is a way to do it in networkx. It's roughly based on the solution I gave here. I'm assuming that a->b and a<-b are two distinct paths you want. I'm going to return this as a list of lists. Each sublist is the (ordered) edges of a path.

    import networkx as nx
    import itertools
    
    def getPaths(G,source,target, maxLength, excludeSet=None):
        #print source, target, maxLength, excludeSet
        if excludeSet== None:
            excludeSet = set([source])
        else:
            excludeSet.add(source)# won't allow a path starting at source to go through source again.
        if maxLength == 0:
            excludeSet.remove(source)
            return []
        else:
            if G.has_edge(source,target):
                paths=[[(source,target)]]
            else:
                paths = []
            if G.has_edge(target,source):
                paths.append([(target,source)])
            #neighbors_iter is a big iterator that will give (neighbor,edge) for each successor of source and then for each predecessor of source.
    
            neighbors_iter = itertools.chain(((neighbor,(source,neighbor)) for neighbor in G.successors_iter(source) if neighbor != target),((neighbor,(neighbor,source)) for neighbor in G.predecessors_iter(source) if neighbor != target))
    
            #note that if a neighbor is both a predecessor and a successor, it shows up twice in this iteration.  
    
            paths.extend( [[edge] + path for (neighbor,edge) in neighbors_iter if neighbor not in excludeSet for path in getPaths(G,neighbor,target,maxLength-1,excludeSet)] )
    
            excludeSet.remove(source) #when we move back up the recursion, don't want to exclude this source any more
    
            return paths
    
    G=nx.DiGraph()
    G.add_edges_from([(1,2),(2,3),(1,3),(1,4),(3,4),(4,3)])
    
    print getPaths(G,1,3,2)
    
    >[[(1, 3)], [(1, 2), (2, 3)], [(1, 4), (4, 3)], [(1, 4), (3, 4)]]
    

    I would expect that by modifying the dijkstra algorithm in networkx you'll arrive at a more efficient algorithm (note that the dijkstra algorithm has a cutoff, but by default it's only going to return the shortest path, and it's going to follow edge direction).

    Here's an alternative version of the whole paths.extend thing: paths.extend( [[edge] + path for (neighbor,edge) in neighbors_iter if neighbor not in excludeSet for path in getPaths(G,neighbor,target,maxLength-1,excludeSet) if len(path)>0 ] )