Search code examples
rperformancetreeigraph

Identify all children of a certain node for very large data


I am trying to find all downstream nodes in a large graph object (~1 million edges).

In this example, I want to see all nodes that are downstream from 4:

enter image description here

The below code is adapted from this answer using igraph but it ran for over 20 minutes before I ended it. I saw the hR package (see here) but it also has trouble with the size of the data.

I'm new to graph analyses and not sure if there are better methods

library(tidyverse)
library(igraph)

graph_df <- # real example has 1.2M connections
  make_tree(100000, children = 3)


leafnodes <- # bottleneck
  sapply(
    X = V(graph_df), 
    FUN = \(x) length(neighbors(graph_df, x)) == 0
  )


find_all_children <- function(root) {
  paths <- 
    get.all.shortest.paths(
      graph = graph_df, 
      from = V(graph_df)[root], 
      to = leafnodes
    )$res
  
  tibble(
    path = map_chr(paths, ~paste(.x, collapse = " -> "))
  )
}

# desired output
find_all_children(12)
#> 12 -> 35 -> 104 -> 311 -> 932 -> 2795 -> 8384
#> 12 -> 35 -> 104 -> 311 -> 932 -> 2795 -> 8385
#> 12 -> 35 -> 104 -> 311 -> 932 -> 2795 -> 8386

find_all_children(1234)
#> 1234 -> 3701 -> 11102 -> 33305 -> 99914
#> 1234 -> 3701 -> 11102 -> 33305 -> 99915
#> 1234 -> 3701 -> 11102 -> 33305 -> 99916

Bonus question:

Is there a purrr method to replace the sapply() here? I didn't have any luck with map()


Solution

  • Probably you can try subgraph + subcomponent to prune your tree first and then find all_simple_paths, e.g.,

    n <- 100000
    
    g <- make_tree(n, children = 3) %>%
        set_vertex_attr(name = "name", value = seq(vcount(.)))
    
    find_all_children <- function(g, root) {
        g %>%
            subgraph(subcomponent(., root, "out")) %>%
            all_simple_paths(1, V(.)[degree(., mode = "out") == 0])
    }
    

    and run the following lines for example

    r12 <- find_all_children(g, 12)
    r1234 <- find_all_children(g, 1234)
    

    which might be done in seconds even with n <- 100000.

    Toy Example

    n <- 20
    
    g <- make_tree(n, children = 2) %>%
        set_vertex_attr(name = "name", value = seq(vcount(.)))
    
    plot(g)
    

    enter image description here

    and we can obtain

    > find_all_children(g, 3)
    [[1]]
    + 3/7 vertices, named, from dab4822:
    [1] 3  6  12
    
    [[2]]
    + 3/7 vertices, named, from dab4822:
    [1] 3  6  13
    
    [[3]]
    + 3/7 vertices, named, from dab4822:
    [1] 3  7  14
    
    [[4]]
    + 3/7 vertices, named, from dab4822:
    [1] 3  7  15