Search code examples
rdata.tablemapplyfurrr

How to do faster list-column operations inside data.table


Due to memory (and speed) issues, I was hoping to do some computations inside a data.table instead of doing them outside it.

The following code has 100.000 rows, but I'm working with 40 million rows.

library(tictoc)
library(data.table) # version 1.11.8
library(purrr)
library(furrr)
plan(multiprocess)

veryfing_function <- function(vec1, vec2){
  vector <- as.vector(outer(vec1, vec2, paste0))
  split(vector, ceiling(seq_along(vector)/length(vec1)))
}


dt <- data.table(letters = replicate(1e6, sample(letters[1:5], 3, TRUE), simplify = FALSE),
                 numbers = replicate(1e6, sample(letters[6:10], 3, TRUE), simplify = FALSE))



tic()
result1 <- future_map2(dt$letters, dt$numbers, veryfing_function)
toc()


tic()
result2 <- mapply(veryfing_function, dt$letters, dt$numbers, SIMPLIFY = FALSE)
toc()



tic()
dt[, result := future_map2(letters, numbers, veryfing_function)]
toc()


tic()
dt[, result2 := mapply(veryfing_function, letters, numbers, SIMPLIFY = FALSE)]
toc()

The output is the same for all variants and as expected. The benchmarks were:

26 secs 72 secs 38 secs 105 secs , so I saw no advantage in using the functions inside the data.table or using mapply.

My major concern is memory, which is not resolved with the future_map2 solution.

I´m using Windows right now, so I was hoping to find a solution for speed other than mclapply, maybe some data.table trick I´m not seeing (keying is not supported for lists)


Solution

  • This is really a question about memory and data storage types. All of my discussion will be for 100,000 data elements so that everything doesn't bog down.

    Let's examine a vector of length 100,000 vs. a list containing 100,000 separate elements.

    object.size(rep(1L, 1E5))
    #400048 bytes
    object.size(replicate(1E5, 1, simplify = F))
    #6400048 bytes
    

    We went from 0.4 MB to 6.4 MB just by having the data stored differently!! When applying this to your function Map(veryfing_function, ...) and only 1E5 elements:

    dt <- data.table(letters = replicate(1e5, sample(letters[1:5], 3, TRUE), simplify = FALSE),
                     numbers = replicate(1e5, sample(letters[6:10], 3, TRUE), simplify = FALSE))
    
    tic()
    result2 <- Map(veryfing_function, dt[['letters']], dt[['numbers']])
    toc()
    # 11.93 sec elapsed
    object.size(result2)
    # 109,769,872 bytes
    #example return:
    [[1000]]
    [[1000]]$`1`
    [1] "cg" "bg" "cg"
    
    [[1000]]$`2`
    [1] "ch" "bh" "ch"
    
    [[1000]]$`3`
    [1] "ch" "bh" "ch"
    

    We could do a simple modification to your function to return unnamed lists instead of splitting and we save a little bit of memory as the split() appears to give named lists and I don't think we need the name:

    verifying_function2 <- function(vec1, vec2) {
      vector <- outer(vec1, vec2, paste0) #not as.vector
      lapply(seq_len(ncol(vector)), function(i) vector[, i]) #no need to split, just return a list
    }
    
    tic()
    result2_mod <- Map(verifying_function2, dt[['letters']], dt[['numbers']])
    toc()
    # 2.86 sec elapsed
    object.size(result2_mod)
    # 73,769,872 bytes
    
    #example_output
    [[1000]]
    [[1000]][[1]]
    [1] "cg" "bg" "cg"
    
    [[1000]][[2]]
    [1] "ch" "bh" "ch"
    
    [[1000]][[3]]
    [1] "ch" "bh" "ch"
    

    The next step is why return a list of list at all. I am using lapply() in the modified function just get to your output. Loosing the lapply() would instead a list of matrices which I think would be as helpful:

    tic()
    result2_mod2 <- Map(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
    toc()
    # 1.66 sec elapsed
    object.size(result2_mod2)
    # 68,570,336 bytes
    
    #example output:
    [[1000]]
         [,1] [,2] [,3]
    [1,] "cg" "ch" "ch"
    [2,] "bg" "bh" "bh"
    [3,] "cg" "ch" "ch"
    

    The last logical step is to just return a matrix. Note this whole time we've been fighting against simplification with mapply(..., simplify = F) which is equivalent to Map().

    tic()
    result2_mod3 <- mapply(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
    toc()
    # 1.3 sec elapsed
    object.size(result2_mod3)
    # 7,201,616 bytes
    

    If you want some dimensionality, you can convert the large matrix into a 3D array:

    tic()
    result2_mod3_arr <- array(as.vector(result2_mod3), dim = c(3,3,1E5))
    toc()
    # 0.02 sec elapsed
    result2_mod3_arr[,,1000]
         [,1] [,2] [,3]
    [1,] "cg" "ch" "ch"
    [2,] "bg" "bh" "bh"
    [3,] "cg" "ch" "ch"
    object.size(result2_mod3_arr)
    # 7,201,624 bytes
    

    I also looked at @marbel's answer - it is faster and takes up only slightly more memory. My approach would likely benefit by converting the initial dt list to something else sooner.

    tic()
    dt1 = as.data.table(do.call(rbind, dt[['letters']]))
    dt2 = as.data.table(do.call(rbind, dt[['numbers']]))
    
    res = data.table()
    
    combs = expand.grid(names(dt1), names(dt2), stringsAsFactors=FALSE)
    
    set(res, j=paste0(combs[,1], combs[,2]), value=paste0( dt1[, get(combs[,1])], dt2[, get(combs[,2])] ) )
    toc()
    # 0.14 sec elapsed
    object.size(res)
    # 7,215,384 bytes
    

    tl;dr - convert your object to a matrix or data.frame to make it easier on your memory. It also makes sense that the data.table versions of your function takes longer - there's likely more overhead than just directly applying mapply().