python graph scipy sparse-matrix breadth-first-search

APGL (Another Python Graph Library) - Set subgraph size after a Breadth First Search

I've been using the apgl 0.8.1. library to analyse a massive network (40 million of nodes). I tried apgl because it works fine with scipy sparse matrices, and now I can load a sparse matrix into memory to perform analyses. I'm interested in obtaining a subgraph with a desidered size of the all network, after a Breadth First Search Analysis.

I'm reading an adjancency list from pandas to build a sparse matrix. Consider this sample network (Node,Node,Weight) called network:

This is the sample code I'm using:

# Import Modules
import pandas as pd
import numpy as np
import scipy as sp
from apgl.graph import SparseGraph

# Load Network
df = pd.read_csv(network,sep='\s+',header=None,names=['User1','User2','W'])

# Convert the numpy array from pandas into a NxN square matrix
# Read Numpy array from Pandas
arr = df.values

# Set matrix shapes
shapes = tuple(arr.max(axis=0)[:2]+1)

# Build a Sparse Matrix
matrix = sp.sparse.csr_matrix((arr[:, 2], (arr[:, 0], arr[:, 1])), 
                        shape=shape,
                        dtype=arr.dtype)

# Set the total number of nodes
numVertices = shape[0]

# Inizialize Graph
graph = SparseGraph(numVertices, undirected=False, W=matrix, frmt='csr')

# Perform BFS starting from one one --> set output to np.array
startingnode = 5
bfs = np.array(graph.breadthFirstSearch(startingnode))

# Return SubGraph with a list of nodes
# Set limit
limit = 5
subgraph = graph.subgraph(bfs[:limit])

which returns:

bfs = [ 5  1  2  6 14 11  3  7 12  4  8 13]
subgraph = SparseGraph: vertices 5, edges 6, directed, vertex storage GeneralVertexList, edge storage <class 'scipy.sparse.csr.csr_matrix'>

So I set up a limit of 5 nodes of the resulting subgraph. But the nodes are chosen from the first to the fifth, no matter the shape of the search. I mean, the Breadth First Search algorithm is looking for neighbours of neighbours and so on, and I would like to set a limit which includes the search in the all final neighbours level of the last node chosen. In the example, the subgraph contains the first five nodes of the BFS array, so:

subgraph = [5 1 2 6 14]

but I would like to include also nodes 7 (which complete the neighbours level starting from node 5) and 3 (which complete the search in the level of node 2). So the resulting array of nodes should be:

subgraph = [5 1 2 3 6 7 14]

Any help would be appreciated.

EDIT: What I would like to do is to find a subgraph of the entire graph starting from a random node, and performing a BFS algorithm till different size of the subgraph, e.g. 5 million, 10 million and 20 million nodes. I would like to complete each level of neighbours of nodes before stopping, so it doesn't matter if the number of nodes is 5 million or 5 million and 100, if the last 100 nodes are needed to complete a level of all neighbours of the last node found in the search.

Solution

BFS and DFS operate over undirected graphs which could be acyclic (i.e. trees) or not.

During their operation, both algorithms may encounter back-edges. That is, edges that if traversed would bring the algorithm to consider nodes already visited.

These edges and the nodes at their ends are not (normally) reported in the output of BFS or DFS.

Networkx (https://networkx.github.io/) contains dfs_labeled_edges which returns an iterator with a characteristic for each edge.

As far as your code is concerned:

limit = 5
subgraph = graph.subgraph(bfs[:limit])

This is not going to search for all BFS subgraphs of 'length' 5. Because of bfs[:5] this will always be looking at the first 5 entries of the BFS' output.

If you are looking for cycles, perhaps you can use a different algorithm (For example, this is again from Networkx), or extract your subgraph from the original network and then use DFS to label its edges, or enumerate all of its simplest paths and try to work on them.

Hope this helps.

Supplementary Information:

(I am also taking this older question of yours into account here: https://stackoverflow.com/questions/29169022/scipy-depth-breadth-first-search-tree-size-limit)

You want to extract a proper subgraph U(W,C) from your main graph G(V,E) with the order of U being much smaller than the order of G (|U|<<|G|) and furthermore with U being the BFS of G starting on some node v_i of G(V,E).

There are two ways you can do this:

Write your own BFS where you can add counters for the depth of traversal and nodes traversed and use them to interrupt the algorithm wherever you like. Due to the extremely large number of nodes that you have, you should look into the iterative rather than the recursive version of the algorithm. For more information, please see: Way to go from recursion to iteration and also to some extent http://en.wikipedia.org/wiki/Depth-limited_search . This approach will be more efficient because the BFS would not have to go through the whole graph.
Truncate the output of an existing BFS and use the remainder nodes as your starting points for the next step.

In either case, your algorithm will contain one more iteration step and will end up looking something like this (here for option #2):

#Given graph G(V,E) and NNodeLimit (Natural number)
#Produce a set Q of BFS proper subgraphs bearing the characteristics of U.
Q = []
nextBFSNode = [0]

while len(nextBFSNode):
    #Pop the node
    startingPoint = nextBFSNode.pop()
    #Build a BFS graph starting from some node
    q = BFS(G, startingPoint)
    #Truncate its output and save it to the list.
    Q.append(subgraph(G,q[:NNodeLimit]))
    #Add the remaining nodes as future starting points
    nextBFSNode = set(q[NNodeLimit+1:])

There will of course be considerable overlap between different trees.

Hope this helps.