So I was really ripping my hair out why two different sessions of R with the same data were producing wildly different times to complete the same task.
After a lot of restarting R, cleaning out all my variables, and really running a clean R, I found the issue: the new data structure provided by vroom
and readr
is, for some reason, super sluggish on my script. Of course the easiest thing to solve this is to convert your data into a tibble as soon as you load it in. Or is there some other explanation, like poor coding praxis in my functions that can explain the sluggish behavior? Or, is this a bug with recent updates of these packages? If so and if someone is more experienced with reporting bugs to tidyverse, then here is a repex
showing the behavior cause I feel that this is out of my ballpark.
#Load packages
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
library(vroom)
library(tidyr)
library(microbenchmark)
#Genenrate some dummy data
ex_data <- tibble(
sd = 1,
mean = 1:1000,
a1 = rnorm(1000, mean, sd),
a2 = rnorm(1000, mean, sd),
a3 = rnorm(1000, mean, sd)
) %>%
mutate(
a1 = if_else(a1<mean, NA_real_, a1),
a2 = if_else(a2<mean, NA_real_, a2),
a3 = if_else(a3<mean, NA_real_, a3)
)
#Wrapper function discovering the behavioure
impute_row <- function(mean, sd, data){
if(!anyNA(data)){
return(data)
}else{
data <- as.data.frame(data)
data[is.na(data)] <- rnorm(n = sum(is.na(data)), mean = mean, sd = sd)
return(data)
}
}
#Main function
imputer <- function(data){
data %>%
mutate(
data = pmap(list(mean, sd, data), impute_row)
) %>%
unnest(cols = data)
}
#Generate dummy file
out_file <- tempfile(fileext = "csv")
vroom_write(ex_data, out_file, ",")
#Read it in
ex_data_spc <- vroom(out_file, col_types = cols()) %>%
nest(data = -c(mean, sd))
#Nest the original data as well
ex_data <- ex_data %>%
nest(data = -c(mean, sd))
#Benchmark
microbenchmark(
tib = imputer(ex_data),
spc_tib = imputer(ex_data_spc),
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> tib 82.81192 87.45288 89.19118 90.47263 91.2216 93.4418 10
#> spc_tib 1041.90378 1070.00579 1244.97090 1076.92022 1093.0054 2780.0722 10
Created on 2021-06-14 by the reprex package (v2.0.0)
Which at the worst-case scenario is almost 30 times slower then running on a tibble.
This is the issue I had in mind. These problems have been known to happen with vroom, rather than with the spec_tbl_df
class, which does not really do much.
vroom
does all sorts of things to try and speed reading up; AFAIK mostly by lazy reading. That's how you get all those different components when comparing the two datasets.
With vroom:
~~~(snip)~~~
ex_data_spc <- vroom(out_file, col_types = cols()) %>%
nest(data = -c(mean, sd))
~~~(snip)~~~
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> spc_tib 1679.2088 1704.3085 2106.864 1731.6694 1942.9444 4918.4498 10 b
#> tib 149.8716 158.8548 169.489 170.3735 182.5681 192.8533 10 a
all.equal(ex_data, ex_data_spc)
#> [1] "Component \"data\": Component 1: Attributes: < Names: 1 string mismatch >"
#> [2] "Component \"data\": Component 1: Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Component \"data\": Component 1: Attributes: < Component \"class\": Lengths (3, 4) differ (string compare on first 3) >"
#> [4] "Component \"data\": Component 1: Attributes: < Component \"class\": 3 string mismatches >"
#> [5] "Component \"data\": Component 1: Attributes: < Component 2: Modes: numeric, externalptr >"
~~~(snip)~~~
With readr:
~~~(snip)~~~
ex_data_spc <- readr::read_csv(out_file, col_types = cols()) %>%
nest(data = -c(mean, sd))
~~~(snip)~~~
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> spc_tib 148.9432 161.7315 181.2137 184.4592 191.9048 219.7883 10 a
#> tib 161.9441 166.7826 175.3644 175.3354 181.4598 197.5544 10 a
all.equal(ex_data, ex_data_spc)
#> [1] TRUE
You could post your reprex to that issue if you're feeling like it.