I am preprocessing a dataset with the R recipes packages, doing Yeo-Johnson transformation to make it more normally distributed and then scaling to standardize it. After that I want to reduce the size of the recipe object, I use the butcher package. But it does not help. I also try to manually clean the 'template' where data is stored, but again the size stays the same. Any idea how to reduce the size for storage and later use? Here is an example of a realistic problem I am facing:
#Lets generate skewed numeric data of size 20 000 x 3 000 (originally I am working with 10x more rows)
n <- 3000
example_list <-
1:n %>%
map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))
names(example_list) <- paste0("col_", 1:n)
example_tibble <- as_tibble(example_list)
#Lets create preprocessing recipe
new_recipe <-
recipe( ~ ., data = example_tibble) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep(strings_as_factors = FALSE, retain = FALSE)
#Lets check the structure and size of the recipe object
#> # A tibble: 9,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.terms 480.
#> 3 steps.lambdas 0.232
#> 4 steps.means 0.232
#> 5 steps.sds 0.232
#> 6 var_info.variable 0.208
#> 7 term_info.variable 0.208
#> 8 last_term_info.variable 0.208
#> 9 template.col_1 0.160
#> 10 template.col_2 0.160
#> # … with 9,024 more rows
#> 481,649,536 B
#Lets try to remove unnecessary parts of the object
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✖ No memory released. Do not butcher.
#Lets check again the size
#> 481,650,016 B
#> # A tibble: 9,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.lambdas 0.232
#> 3 steps.means 0.232
#> 4 steps.sds 0.232
#> 5 var_info.variable 0.208
#> 6 term_info.variable 0.208
#> 7 last_term_info.variable 0.208
#> 8 template.col_1 0.160
#> 9 template.col_2 0.160
#> 10 template.col_3 0.160
#> # … with 9,024 more rows
#Lets try to remove the template with data
new_recipe_butchered$template <- NULL
#> # A tibble: 6,034 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.lambdas 0.232
#> 3 steps.means 0.232
#> 4 steps.sds 0.232
#> 5 var_info.variable 0.208
#> 6 term_info.variable 0.208
#> 7 last_term_info.variable 0.208
#> 8 var_info.role 0.0241
#> 9 var_info.source 0.0241
#> 10 term_info.role 0.0241
#> # … with 6,024 more rows
#Lets check again the size - still the same
#> 481,650,016 B
Created on 2021-06-17 by the reprex package (v0.3.0)
It seems I am not able to reduce the size, can someone help?
This issue has been resolved in the developmental version of {butcher} which you can download with
# install.packages("devtools")
{butcher} will now remove the terms
environment from steps.
n <- 3000
example_list <-
1:n %>%
map(~abs(rnorm(n = 20000, mean = 0, sd = sample(seq(0.1, 10, length.out = n), size = n))))
names(example_list) <- paste0("col_", 1:n)
example_tibble <- as_tibble(example_list)
new_recipe <-
recipe( ~ ., data = example_tibble) %>%
step_YeoJohnson(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep(strings_as_factors = FALSE, retain = FALSE)
#> # A tibble: 12,033 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.terms 480.
#> 2 steps.terms 480.
#> 3 steps.lambdas 0.232
#> 4 steps.means 0.232
#> 5 steps.sds 0.232
#> 6 var_info.variable 0.208
#> 7 term_info.variable 0.208
#> 8 last_term_info.variable 0.208
#> 9 var_info.role 0.0241
#> 10 var_info.source 0.0241
#> # … with 12,023 more rows
#> 481,985,880 B
new_recipe_butchered <- butcher::butcher(new_recipe, verbose = TRUE)
#> ✓ Memory released: '480,170,888 B'
#> 1,814,992 B
#> # A tibble: 12,033 x 2
#> object size
#> <chr> <dbl>
#> 1 steps.lambdas 0.232
#> 2 steps.means 0.232
#> 3 steps.sds 0.232
#> 4 var_info.variable 0.208
#> 5 term_info.variable 0.208
#> 6 last_term_info.variable 0.208
#> 7 var_info.role 0.0241
#> 8 var_info.source 0.0241
#> 9 term_info.role 0.0241
#> 10 term_info.source 0.0241
#> # … with 12,023 more rows
Created on 2021-06-17 by the reprex package (v2.0.0)