Search code examples
rrecursiontreetidyverse

Tidyverse way of repeatedly following parent IDs until ancestor


The themes dataset from Rebrickable includes for each theme its ID and its parent's ID (columns have been renamed here), which may recurse (ID may have grandparents, great-grandparents, etc.) Here is an example which follows the parent chain City -> Advent -> Seasonal:

themes <- data.frame(theme_id = c(206, 207, 208), 
                     theme_name = c("Seasonal", "Advent", "City"), 
                     parent_theme_id = c(NA, 206, 207))

Here is my base R code to follow the parent IDs up until reaching NA, meaning no parent:

for (i in 1:nrow(themes)) {
  orig_id <- themes[i,]$theme_id
  cur_id <- orig_id
  repeat {
    par_id <- themes[themes$theme_id == cur_id,]$parent_theme_id
    if (is.na(par_id)) break
    cur_id <- par_id
  }
 
  themes[themes$theme_id == orig_id,]$ancestor_theme_id = cur_id
}

Is there a nicer tidyverse or base R way of creating a new column that consists of the greatest ancestor IDs? I can't think of a nice way to recursively follow ID "pointers" without operating one row at a time and having a loop somewhere.


Solution

  • As @r2evans explains in the comments, there aren't tidyverse functions for the recursive lookup itself, but you can make your code a bit more succinct and tidyverse-y by defining a recursive function and calling it from mutate(map_dbl()):

    library(dplyr)
    library(purrr)
    
    find_ancestor <- function(id) {
      dat <- cur_data()
      parent <- filter(dat, theme_id == id)$parent_theme_id
      if (is.na(parent)) id else find_ancestor(parent)
    }
    
    themes %>% 
      mutate(ancestor_theme_id = map_dbl(theme_id, find_ancestor))
    
      theme_id theme_name parent_theme_id ancestor_theme_id
    1      206   Seasonal              NA               206
    2      207     Advent             206               206
    3      208       City             207               206