Search code examples
rdplyrpurrrreadr

spec_tbl_df is over 10 times slower on same opperations as a normal tibble


So I was really ripping my hair out why two different sessions of R with the same data were producing wildly different times to complete the same task. After a lot of restarting R, cleaning out all my variables, and really running a clean R, I found the issue: the new data structure provided by vroom and readr is, for some reason, super sluggish on my script. Of course the easiest thing to solve this is to convert your data into a tibble as soon as you load it in. Or is there some other explanation, like poor coding praxis in my functions that can explain the sluggish behavior? Or, is this a bug with recent updates of these packages? If so and if someone is more experienced with reporting bugs to tidyverse, then here is a repex showing the behavior cause I feel that this is out of my ballpark.

#Load packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(vroom)
library(tidyr)
library(microbenchmark)
#Genenrate some dummy data
ex_data <- tibble(
  sd = 1,
  mean = 1:1000,
  a1 = rnorm(1000, mean, sd),
  a2 = rnorm(1000, mean, sd),
  a3 = rnorm(1000, mean, sd)
  ) %>% 
  mutate(
    a1 = if_else(a1<mean, NA_real_, a1),
    a2 = if_else(a2<mean, NA_real_, a2),
    a3 = if_else(a3<mean, NA_real_, a3)
  )
#Wrapper function discovering the behavioure
impute_row <- function(mean, sd, data){
  if(!anyNA(data)){
    return(data)
  }else{
    data <- as.data.frame(data)
    data[is.na(data)] <-  rnorm(n = sum(is.na(data)), mean = mean, sd = sd)
    return(data)
  }
}
#Main function
imputer <- function(data){
  data %>% 
    mutate(
      data = pmap(list(mean, sd, data), impute_row)
    ) %>% 
    unnest(cols = data)
}
#Generate dummy file
out_file <- tempfile(fileext = "csv")
vroom_write(ex_data, out_file, ",")
#Read it in
ex_data_spc <- vroom(out_file, col_types = cols()) %>% 
  nest(data = -c(mean, sd))
#Nest the original data as well
ex_data <- ex_data %>% 
  nest(data = -c(mean, sd))
#Benchmark
microbenchmark(
  tib = imputer(ex_data),
  spc_tib = imputer(ex_data_spc),
  times = 10
)
#> Unit: milliseconds
#>     expr        min         lq       mean     median        uq       max neval
#>      tib   82.81192   87.45288   89.19118   90.47263   91.2216   93.4418    10
#>  spc_tib 1041.90378 1070.00579 1244.97090 1076.92022 1093.0054 2780.0722    10

Created on 2021-06-14 by the reprex package (v2.0.0)

Which at the worst-case scenario is almost 30 times slower then running on a tibble.


Solution

  • This is the issue I had in mind. These problems have been known to happen with vroom, rather than with the spec_tbl_df class, which does not really do much.

    vroom does all sorts of things to try and speed reading up; AFAIK mostly by lazy reading. That's how you get all those different components when comparing the two datasets.

    With vroom:

    ~~~(snip)~~~
    ex_data_spc <- vroom(out_file, col_types = cols()) %>% 
      nest(data = -c(mean, sd))
    ~~~(snip)~~~
    
    #> Unit: milliseconds
    #>     expr       min        lq     mean    median        uq       max neval cld
    #>  spc_tib 1679.2088 1704.3085 2106.864 1731.6694 1942.9444 4918.4498    10   b
    #>      tib  149.8716  158.8548  169.489  170.3735  182.5681  192.8533    10  a
    
    all.equal(ex_data, ex_data_spc)
    #>    [1] "Component \"data\": Component 1: Attributes: < Names: 1 string mismatch >"                                                 
    #>    [2] "Component \"data\": Component 1: Attributes: < Length mismatch: comparison on first 2 components >"                        
    #>    [3] "Component \"data\": Component 1: Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"   
    #>    [4] "Component \"data\": Component 1: Attributes: < Component \"class\": 3 string mismatches >"                                 
    #>    [5] "Component \"data\": Component 1: Attributes: < Component 2: Modes: numeric, externalptr >"  
                                   
    ~~~(snip)~~~
    

    With readr:

    ~~~(snip)~~~
    ex_data_spc <- readr::read_csv(out_file, col_types = cols()) %>% 
      nest(data = -c(mean, sd))
    ~~~(snip)~~~
    #> Unit: milliseconds
    #>     expr      min       lq     mean   median       uq      max neval cld
    #>  spc_tib 148.9432 161.7315 181.2137 184.4592 191.9048 219.7883    10   a
    #>      tib 161.9441 166.7826 175.3644 175.3354 181.4598 197.5544    10   a
    
    all.equal(ex_data, ex_data_spc)
    #> [1] TRUE
    

    You could post your reprex to that issue if you're feeling like it.