Search code examples
rperformanceigraphsubgraph

sampling subgraphs from different sizes using igraph


I have an igraph object mygraph with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes. What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes. I have some code but it takes a long time to do all the calculations. I thought in using the graphlets function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.

Here is the code I am using:

First step was to create a function to create the subgraphs of different sizes and do the calculations needed.

random_network<-function(size,G){
     score_fun<-function(g){                                                        
          subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
           subsum
           } 

      genes.idx <- V(G)$name
      perm <- c()
      while(length(perm)<1000){
           seed<-sample(genes.idx,1) 
           while( length(seed)<size ){
                tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
                tmp.neigh <- setdiff(tmp.neigh, seed)
                if( length(tmp.neigh)>0 )  
                seed<-c(seed,sample(tmp.neigh,1)) else break 
            }
      if( length(seed)==size )
      perm <- c(perm,score_fun(induced.subgraph(G,seed)))
      } 
      perm
     } 

Second step was to apply the function to the actual graph

 ### generate some example data
 library(igraph)
 my_graph <- erdos.renyi.game(10000, 0.0003)
 V(my_graph)$name <- 1:vcount(my_graph)
 V(my_graph)$weight <- rnorm(10000)
 V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)

 ### Run the code to get the subgraphs from different size and do calculations based on nodes
 genesets.length<- seq(5:500)
 genesets.length.null.dis <- list()
 for(k in 5:max(genesets.length){ 
     genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
  }

Solution

  • One approach to speed up your code further than what's possible in base R would be to use the Rcpp package. Consider the following Rcpp implementation of the full operation. It takes as input the following:

    • valid: The indices of all nodes that are in a large-enough component
    • el, deg, firstPos: A representation of the graph's edge list (node i's neighbors are el[firstPos[i]], el[firstPos[i]+1], ..., el[firstPos[i]+deg[i]-1]).
    • size: The subgraph size to sample
    • nrep: The number of repetitions
    • weights: The edge weights stored in V(G)$weight
    • RWRNodeweight: The edge weights stored in V(G)$RWRNodeweight
    library(Rcpp)
    cppFunction(
    "NumericVector scores(IntegerVector valid, IntegerVector el, IntegerVector deg,
                          IntegerVector firstPos, const int size, const int nrep,
                          NumericVector weights, NumericVector RWRNodeweight) {
      const int n = deg.size();
      std::vector<bool> used(n, false);
      std::vector<bool> neigh(n, false);
      std::vector<int> neighList;
      std::vector<double> scores(nrep);
      for (int outerIter=0; outerIter < nrep; ++outerIter) {
        // Initialize variables
        std::fill(used.begin(), used.end(), false);
        std::fill(neigh.begin(), neigh.end(), false);
        neighList.clear();
    
        // Random first node
        int recent = valid[rand() % valid.size()];
        used[recent] = true;
        double wrSum = weights[recent] * RWRNodeweight[recent];
        double rrSum = RWRNodeweight[recent] * RWRNodeweight[recent];
    
        // Each additional node
        for (int idx=1; idx < size; ++idx) {
          // Add neighbors of recent
          for (int p=firstPos[recent]; p < firstPos[recent] + deg[recent]; ++p) {
            if (!neigh[el[p]] && !used[el[p]]) {
              neigh[el[p]] = true;
              neighList.push_back(el[p]);
            }
          }
    
          // Compute new node to add from all neighbors
          int newPos = rand() % neighList.size();
          recent = neighList[newPos];
          used[recent] = true;
          wrSum += weights[recent] * RWRNodeweight[recent];
          rrSum += RWRNodeweight[recent] * RWRNodeweight[recent];
    
          // Remove from neighList
          neighList[newPos] = neighList[neighList.size() - 1];
          neighList.pop_back();
        }
    
        // Compute score from wrSum and rrSum
        scores[outerIter] = wrSum / sqrt(rrSum);
      }
      return NumericVector(scores.begin(), scores.end());
    }
    ")
    

    Now all that we need to do in base R is generate the arguments for scores, which can be done pretty easily:

    josilber.rcpp <- function(size, num.rep, G) {
      n <- length(V(G)$name)
    
      # Determine which nodes fall in sufficiently large connected components
      comp <- components(G)
      valid <- which(comp$csize[comp$membership] >= size)
    
      # Construct an edge list representation for use in the Rcpp code
      el <- get.edgelist(G, names=FALSE) - 1
      el <- rbind(el, el[,2:1])
      el <- el[order(el[,1]),]
      deg <- degree(G)
      first.pos <- c(0, cumsum(head(deg, -1)))
    
      # Run the proper number of replications
      scores(valid-1, el[,2], deg, first.pos, size, num.rep,
             as.numeric(V(G)$weight), as.numeric(V(G)$RWRNodeweight))
    }
    

    The time to perform 1000 replications is blazing fast compared to the original code and all igraph solutions we've seen so far (note that for much of this benchmarking I tested the original josilber and random_network functions for 1 replication instead of 1000 because testing for 1000 would take a prohibitively long time):

    • Size=10: 0.06 seconds (a 1200x speedup over my previously proposed josilber function and a 4000x speedup over the original random_network function)
    • Size=100: 0.08 seconds (a 8700x speedup over my previously proposed josilber function and a 162000x speedup over the original random_network function)
    • Size=1000: 0.13 seconds (a 32000x speedup over my previously proposed josilber function and a 20.4 million times speedup over the original random_network function)
    • Size=5000: 0.32 seconds (a 68000x speedup over my previously proposed josilber function and a 290 million times speedup over the original random_network function)

    In short, Rcpp probably makes it feasible to compute 1000 replicates for each size from 5 to 500 (this operation will probably take about 1 minute in total), while it would be prohibitively slow to compute the 1000 replicates for each of these sizes using the pure R code that's been proposed so far.