Search code examples
rpurrrlookup-tables

Complete incompletely linked documents (document trees) with a lookup-list using a function


I have linked documents (document trees) saved in a list (list)

Some of the documents trees have items that are not complete (marked with seach=1). Some trees might have multiple incomplete trees that are marked with search=1.

I would like to extend/complete these incomplete trees using lookup list that contain document trees (list_lookup), there is always only one matching tree in list and list_lookup. The level of the matching document tree should be adjusted to the document tree in list.

Sample data and desired output:

library(tidyverse)

# initial df1, aaa is incomplete (it is in fact linked to other documents, but this information is stored in the lookup list)
 
df1 <- tibble(id_from=c(NA_character_,"111","222","333","444","444","bbb"),
             id_to=c("111","222","333","444","aaa","bbb","ccc"),
             level=c(0,1,2,3,4,4,5),
             search=c(0,0,0,0,1,0,0))
df1
#> # A tibble: 7 × 4
#>   id_from id_to level search
#>   <chr>   <chr> <dbl>  <dbl>
#> 1 <NA>    111       0      0
#> 2 111     222       1      0
#> 3 222     333       2      0
#> 4 333     444       3      0
#> 5 444     aaa       4      1
#> 6 444     bbb       4      0
#> 7 bbb     ccc       5      0


# lookup dfs, df2 contains the further document links of aaa
df2 <- tibble(id_from=c(NA,"aaa","x","x"),
             id_to=c("aaa","x","x1","x2"),
             level=c(0,1,2,2))

df3 <- tibble(id_from=c(NA,"thank"),
                     id_to=c("thank","you"),
                     level=c(0,1))

#list with df
list <- list(df1)

#list with lookups
list_lookup <- list(df2,df3)

list_lookup
#> [[1]]
#> # A tibble: 4 × 3
#>   id_from id_to level
#>   <chr>   <chr> <dbl>
#> 1 <NA>    aaa       0
#> 2 aaa     x         1
#> 3 x       x1        2
#> 4 x       x2        2
#> 
#> [[2]]
#> # A tibble: 2 × 3
#>   id_from id_to level
#>   <chr>   <chr> <dbl>
#> 1 <NA>    thank     0
#> 2 thank   you       1

#what I need; an updated list of dfs where information from the lookup list are included

df1_wanted <- tibble(id_from=c(NA_character_,"111","222","333","444","444","aaa","bbb","x","x"),
                     id_to=c("111","222","333","444","aaa","bbb","x","ccc","x1","x1"),
                     level=c(0,1,2,3,4,4,5,5,6,6))

list(df1_wanted)
#> [[1]]
#> # A tibble: 10 × 3
#>    id_from id_to level
#>    <chr>   <chr> <dbl>
#>  1 <NA>    111       0
#>  2 111     222       1
#>  3 222     333       2
#>  4 333     444       3
#>  5 444     aaa       4
#>  6 444     bbb       4
#>  7 aaa     x         5  <- added from df2, level adjusted
#>  8 bbb     ccc       5  
#>  9 x       x1        6  <- added from df2, level adjusted
#> 10 x       x1        6  <- added from df2, level adjusted

Created on 2023-04-01 with reprex v2.0.2

My approach:

I thought about using purrr::map to map a function to every item of list, however, I am not sure how this function should look like.


Solution

  • In this solution:

    1. I first define a recursive function, get_tree(), that takes a single id and lookup table and returns the full tree from the table for that id.
    2. Then, I define a function, complete_tree(), that takes a dataframe and a list of lookup tables, iterates over get_tree() for every id_to where search == 1 and for each lookup table, adjusts level, and binds the results to the initial dataframe.
    3. Finally, I iterate over complete_tree() for every element of list.
    library(dplyr)
    library(purrr)
    
    get_tree <- function(id, lookup) {
      branch <- filter(lookup, id_from == id)
      if (nrow(branch) == 0) return()
      bind_rows(
        branch, 
        map(branch$id_to, \(x) get_tree(x, lookup))
      )
    }
    
    complete_trees <- function(data, lookups) {
      branches <- pmap(
        filter(data, search == 1),
        \(id_to, level, ...) {
          bind_rows(map(
              lookups, 
              \(lookup) get_tree(id_to, lookup)
            )) %>%
            mutate(level = level + .env$level)
        }
      )
      bind_rows(data, branches) %>%
        select(!search) %>%
        arrange(level, id_from)
    }
    
    map(list, \(x) complete_trees(x, lookups = list_lookup))
    

    Result:

    [[1]]
    # A tibble: 10 × 3
       id_from id_to level
       <chr>   <chr> <dbl>
     1 <NA>    111       0
     2 111     222       1
     3 222     333       2
     4 333     444       3
     5 444     aaa       4
     6 444     bbb       4
     7 aaa     x         5
     8 bbb     ccc       5
     9 x       x1        6
    10 x       x2        6