Search code examples
rdplyrdata.tabledtplyr

Applying dtplyr directly to a data.table instead of a lazy_dt


I want to perform several operations intertwining dtplyr and data.table code. My question is whether, having loaded dtplyr, I can apply dplyr verbs to a data.table object and get optimized data.table code as I would with a lazy_dt.

I here provide some examples and ask: would dtplyr translate to data.table code here? Or is simply dplyr working?

# Setup for all chunks:
library(dplyr)
library(data.table)
library(dtplyr)

a) setDT

dataframe # class data.frame
setDT(dataframe)

dataframe %>% 
  group_by(id) %>% 
  mutate(rows_per_group = n())

b) data.table object

dt <- as.data.table(dataframe) # or dt <- data.table::fread(filepath)
dt %>%
  group_by(id) %>% 
  mutate(rows_per_group = n())

Also, if all of them make dtplyr work. What is the most efficient option between a), b) and c) using lazy_dt(dataframe)?


Solution

  • I was wondering about similiar question and after reading this post I run some benchmarks. I varied the following

    • Function of which package is used: data.table, dplyr or dtplyr
    • Object class: tibble or data.table

    The results are:

    results

    The results do not confirm that "If you have a data.table, using it with any dplyr generic will automatically convert it to a lazy_dt object" because applying dplyr function on the data.table object is much slower than applying dtplyr::lazy_dt() function. Further, as you can see dtplyr::lazy_dt() performs faster if you provide a data.table object (vs. tibble). But it makes no sense to transform the object from tibble to data.table before applying dtplyr::lazy_dt() on it, because with the time needed for transformation + aplying dtplyr::lazy_dt() you are as fast if you directly apply dtplyr::lazy_dt() on a tibble object (compare results of dtplyr() and dtplyr_trans() function where as.data.table(data) is used at the start to transform the given object to data.table).

    The code I used is

    # Data generated as in linked blog post
    library(data.table)
    library(dplyr)
    library(dtplyr)
    library(microbenchmark)
    library(ggplot2)
    
    N <- 1e7
    K <- 100
    set.seed(1)
    
    dttbl <- data.table(
      id1 = sample(sprintf("id%03d", 1:K), N, TRUE), # large groups (char)
      id5 = sample(N / K, N, TRUE), # small groups (int)
      v1 = sample(5, N, TRUE), # int in range [1,5]
      v2 = sample(5, N, TRUE), # int in range [1,5]
      v3 = sample(round(runif(100, max = 100), 4), N, TRUE) # numeric, e. g. 23.5749
    )
    tbbl <- as_tibble(dttbl)
    
    # data.table method.
    dt_fun <- function(data){
      data[, lapply(.SD, sum), keyby = id5, .SDcols = 3:5]
    }
    
    
    # dtplyr method with lazy_dt.
    dtplyr_fun <- function(data){
      data %>%
        lazy_dt() %>%
        group_by(id5) %>%
        summarise_at(vars(v1:v3), sum) %>%
        as_tibble()
    }
    
    # dtplyr method with lazy_dt where the provided data is transformed
    # to data.table first.
    dtplyr_trans_fun <- function(data){
      data %>%
        as.data.table() %>%
        lazy_dt() %>%
        group_by(id5) %>%
        summarise_at(vars(v1:v3), sum) %>%
        as_tibble()
    }
    
    # dplyr method.
    dplyr_fun <- function(data){
      data %>%
        group_by(id5) %>%
        summarise_at(vars(v1:v3), sum) %>%
        as_tibble()
    }
    
    results <- list(dttbl, tbbl) %>%
      lapply(., function(object_i){
        if(is.data.table(object_i)){
          microbenchmark(
            dt= dt_fun(data= object_i),
           dtplyr= dtplyr_fun(data= object_i), dtplyr_trans= dtplyr_trans_fun(data= object_i),
            dplyr= dplyr_fun(data= object_i),
            times= 20) %>%
            {data.frame(method= .$expr, time= .$time, class= class(object_i)[1])} %>%
            mutate(method= gsub("data = object_i", "", method))
        } else{
          microbenchmark(
            dtplyr= dtplyr_fun(data= object_i), dtplyr_trans= dtplyr_trans_fun(data= object_i),
            dplyr= dplyr_fun(data= object_i),
            times= 20) %>%
            {data.frame(method= .$expr, time= .$time, class= class(object_i)[1])} %>%
            mutate(method= gsub("data = object_i", "", method))
        }
      }) %>%
      do.call("rbind.data.frame", .)
    
    results %>%
      mutate(method= factor(method, c("dt", "dtplyr", "dtplyr_trans", "dplyr")),
             class= factor(class, unique(class))) %>%
      ggplot(., aes(time, method, fill= class)) +
      geom_boxplot() +
      guides(fill= guide_legend(reverse= TRUE)) +
      theme_bw()