Search code examples
rdplyrpurrrlubridate

Using purrr to help transform a large data file


I have a bit of code that goes through a number of columns containing dates and selects the earliest date from the options to populate a new column with. To do this I was using the dplyr::rowwise function.

Unfortunately, the data set is quite big and comes at a time cost in obtaining an output. Here is an example of my initial approach.

library(tidyverse)
library(lubridate)

set.seed(101)

data <- tibble(date1 = sample(
  seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
  100, replace = TRUE),
  date2 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
    100, replace = TRUE),
  date3 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
    100, replace = TRUE),
  date4 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
    100, replace = TRUE),
  date5 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
    100, replace = TRUE))

So for the first attempt I opted for rowwise. I hadn't used this before, but the output is identified as 'rowwise_df', which I take to be similar if I had used group_by.

data <- data %>%
  rowwise() %>%
  mutate(earlierst_date = min(c(date1, date2, date3, date4, date5),
                              na.rm = TRUE))

Having looked around, it would appear that rowwise is not considered the best approach (see excellent back and forth here). Reading through, I attempted the following...

data <- data %>%
  mutate(try_again = pmap(list(date1, date2, date3, date4, date5), 
                          min, na.rm = TRUE)) %>%
  mutate(try_again = as_date(try_again))

table(data$earlierst_date == data$try_again)
#> 
#> TRUE 
#>  100

According to my reprex run the second option is twice as fast.

start.time <- Sys.time()
data <- data %>%
  rowwise() %>%
  mutate(earlierst_date = min(c(date1, date2, date3, date4, date5),
                              na.rm = TRUE))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
#> Time difference of 0.07597804 secs

start.time <- Sys.time()
data <- data %>%
  mutate(try_again = pmap(list(date1, date2, date3, date4, date5), 
                          min, na.rm = TRUE)) %>%
  mutate(try_again = as_date(try_again))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
#> Time difference of 0.03266287 secs

My questions:

1. Is the second strategy using pmap fit for purpose or is there some inherent error present that I can't see? For example, in earlier attempts, the output column contained list values rather than vectors which threw me.

I get dizzy anytime I have to work with dates, especially when I read comments such as "A date is a day stored as the number of days since 1970-01-01"...

2. Do the code run times make sense?

Any improvements/direction greatly received.


Solution

  • I agree with @det that rowwise isn't the way to go. I think perhaps the pmin function might be the best suited to the task, e.g.

    data <- transform(data, earliest_date = pmin(date1, date2, date3, date4, date5, na.rm = TRUE))
    

    Benchmarking (updated to include a data.table solution):

    library(tidyverse)
    library(lubridate)
    
    set.seed(101)
    
    data <- tibble(date1 = sample(
      seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
      100, replace = TRUE),
      date2 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                     100, replace = TRUE),
      date3 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                     100, replace = TRUE),
      date4 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                     100, replace = TRUE),
      date5 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                     100, replace = TRUE))
    
    rowwise_func <- function(data){
      data %>%
        rowwise() %>%
        mutate(earliest_date = min(c(date1, date2, date3, date4, date5),
                                   na.rm = TRUE)) %>% 
        ungroup()
    }
    
    pmap_func <- function(data){
      data %>% 
        mutate(try_again = pmap(list(date1, date2, date3, date4, date5), 
                              min, na.rm = TRUE))
      }
    
    det_func1 <- function(data){
      data %>%
      mutate(min_date = pmap_dbl(select(., matches("^date")), min) %>% as.Date(origin = "1970-01-01"))
    }
    
    det_faster <- function(data){
      data[["min_date"]] <- data %>% 
        mutate(across(where(is.Date), as.integer)) %>% 
        as.matrix() %>% 
        apply(1, function(x) x[which.min(x)]) %>%
        as.Date(origin = "1970-01-01")
    }
    
    transform_func <- function(data){
      as_tibble(transform(data, earliest_date = pmin(date1, date2, date3, date4, date5, na.rm = TRUE)))
    }
    
    dt_func <- function(data){
      setDT(data)
      data[, earliest_date := pmin(date1, date2, date3, date4, date5, na.rm = TRUE)]
    }
    
    times <- microbenchmark::microbenchmark(rowwise_func(data), pmap_func(data), det_func1(data), det_faster(data), transform_func(data), dt_func(data))
    autoplot(times)
    
    data2 <- transform_func(data)
    data3 <- rowwise_func(data)
    identical(data2, data3)
    #> TRUE
    

    example_3.png

    Unit: microseconds
                     expr      min        lq      mean    median        uq        max neval cld
       rowwise_func(data) 6764.693 6919.6720 7375.0418 7066.6220 7271.5850  16290.696   100  ab
          pmap_func(data) 3994.973 4150.1360 9425.3880 4252.9850 4437.2950 491030.248   100   b
          det_func1(data) 5576.240 5724.6820 6249.7573 5845.3305 5985.5940  15106.741   100  ab
         det_faster(data) 3182.016 3305.3525 3556.8628 3362.8720 3444.0505  12771.952   100  ab
     transform_func(data)  564.194  624.1055  697.5630  680.1130  718.7975   1513.184   100  a 
            dt_func(data)  650.611  723.7235  956.7916  759.3355  782.0565  10806.902   100  a 
    

    So, based on the functions I used above, the transform + pmin method was ~ 10X faster than the rowwise method.