Search code examples
rlistdplyrdata.tablemapply

Efficiently merging named lists


I have two extremely large datasets I am looking for a way to efficiently merging them (inner join is fine). I came up with a solution of creating named spit lists on the merging keys and merging them as such.

The solution is unfortunately very inefficient still. Is there a way I can avoid using dplyr at all I believe that is the root of the issue, as well as the slow lapply. Is map a good solution?

Here you can find a reproducible example Thanks in advance!

library(tidyverse)
library(data.table)

named_group_split <- function(.tbl, ...) {
  
  # get names
  grouped <- group_by(.tbl, ...)
  names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = "-")))
  
  # split data
  grouped %>% 
    group_split(.keep = FALSE) %>% 
    rlang::set_names(names)
}


# FIRST SPLIT LIST
set.seed(1)
db_1 <- data.frame(id_1=sample(1:10, 10, replace=T),
                   id_2=sample(LETTERS, 10, replace=T),
value1=runif(10, 1.0, 10.0)) %>% 
  data.table() %>% 
  dplyr::mutate(id_1name=id_1,
                id_2name=id_2) %>% 
  named_group_split(id_1name,id_2name)

# SECOND SPLIT LIST
set.seed(2)
db_2 <- data.frame(id_1=sample(1:10, 1000, replace=T),
                   id_2=sample(LETTERS, 1000, replace=T),
                   value2=runif(1000, 1.0, 10.0)) %>% 
  data.table() %>% 
  dplyr::mutate(id_1name=id_1,
                id_2name=id_2) %>% 
  named_group_split(id_1name,id_2name)


keys <- intersect(unique(c(names(db_1))), unique(c(names(db_2))))

result <- setNames(lapply(keys, function(key){
  result <- db_1[[key]] %>% 
    dplyr::left_join(db_2[[key]])
}),keys) %>% 
  do.call(plyr:::rbind.fill,.) 

WHAT I MANAGED TO DO IS THE FOLLOWING:

library(purrr)

my_lists <- function(list1, list2) {  
  keys <- intersect(unique(c(names(list1))), unique(c(names(list2))))
  result <- map2(list1[keys], list2[keys], c) 
  
  return(result)
  
}

result <- my_lists(db1, db2)

But from there I cannot recreate the db I need...


Solution

  • Perhaps I am missing something, but I think you are looking for a right join.

    library(data.table)
    set.seed(1)
    x <- data.table(id_1=sample(1:10, 10, replace=T),
                    id_2=sample(LETTERS, 10, replace=T))
    set.seed(2)
    y <- data.table(id_1=sample(1:10, 1000, replace=T),
                    id_2=sample(LETTERS, 1000, replace=T),
                    value=runif(1000, 1.0, 10.0))
    
    res <- y[x, on = c("id_1", "id_2")]
    setorderv(res, c("id_1", "id_2"))
    res
    
        id_1 id_2    value
     1:    1    N 8.441071
     2:    1    N 5.680715
     3:    1    U 3.681615
     4:    1    U 7.159788
     5:    1    U 2.732143
     6:    1    U 1.246033
     7:    2    I 5.810975
     8:    2    I 7.015206
     9:    2    J 4.238503
    10:    2    J 6.705899
    11:    2    J 7.852617
    12:    3    O 8.645228
    13:    3    O 7.496095
    14:    3    O 2.553506
    15:    4    J 2.158799
    16:    4    J 4.876194
    17:    4    J 4.118119
    18:    4    J 8.434567
    19:    4    J 9.263452
    20:    4    J 3.661710
    21:    5    E 1.112813
    22:    5    E 4.654503
    23:    7    G 1.307912
    24:    7    G 6.233104
    25:    7    G 3.577720
    26:    7    G 1.918811
    27:    7    G 7.521629
    28:    7    V 2.511941
    29:    7    V 2.785865
    30:    7    V 6.875612
    31:    7    V 4.133124
    32:    7    V 8.655604
    33:    9    U 3.090105
    34:    9    U 5.347673
    

    Which is about 250 times faster on my pc, but the effect might be different for the actual set.

    Is there a way I can avoid using dplyr at all I believe that is the root of the issue, as well as the slow lapply. Is map a good solution?

    Generally when dplyr is too slow it is worth having a look at data.table, or at the dtplyr, which converts dplyr syntax to data.table. lapply might be somewhat slow for large datasets if not handled with care. When variables are properly initialized, the looping approach of choice should not be the main bottleneck of the task. If it is, try vapply.