I have tried to perform a YeoJohnson transformation using caret
and recipes
, but I think I am not specifying the calls properly or I am missing some extra parameters.
library(tidyverse)
library(tidytuesdayR)
# Data is all numeric except for column 7
# get it from
# https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv
# or load it with tt_load()
spam <- tt_load(2023, week=33)$spam
# pre-process
pp_hpc <- caret::preProcess(spam[,1:6],
method = c("center", "scale", "YeoJohnson"))
# fails to transform variables all variables
pp_hpc
Created from 4601 samples and 6 variables
Pre-processing:
- centered (6)
- ignored (0)
- scaled (6)
- Yeo-Johnson transformation (1)
Lambda estimates for Yeo-Johnson transformation:
0
# I can apply the transformation but obviously doesn't do the expected transformation in all the columns
transformed <- predict(pp_hpc, newdata = df$spam[,1:6])
Trying with recipes
now
# recipes package
library(recipes)
# do I really need this just to transform the data?
rec <- recipe(
yesno ~ .,
data = spam
)
yj_transform <- step_YeoJohnson(rec, all_numeric())
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates
── Recipe ────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 6
── Training information
Training data contained 4601 data points and no
incomplete rows.
── Operations
• Yeo-Johnson transformation on: crl.tot, bang | Trained
Same, applying works, but not all columns are transformed (I am also not centering/scaling since it is not the problem here).
yj_te <- bake(yj_estimates, spam)
The bestNormalize
package seems to have no issue here:
# works as expected
df_transformed <- select(spam, where(is.numeric)) %>%
mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))
Just in case, this is how I would do it in python or using reticulate
# Python version
library(reticulate)
repl_python()
from sklearn import preprocessing
X = r.spam.drop('yesno', axis = 1)
scaler = preprocessing.PowerTransformer().set_output(transform="pandas")
X = scaler.fit_transform(X)
So the Yeo-Johnson transformation requires a lambda
value which either needs to be supplied or estimated. step_YeoJohnson()
estimates a suitable value for each variable.
The limits
argument sets a default range of values to search in. It defaults to c(-5,5)
.
The documentation states:
If the transformation parameters are estimated to be very closed to the bounds, or if the optimization fails, a value of NA is used and no transformation is applied.
So based on that, if you increase the bounds to search for suitable values of lambda
in you'll likely see more variables being transformed. Indeed, when I run:
yj_transform <- step_YeoJohnson(rec, all_numeric(),limits = c(-20,20))
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates
The summary output reports that it ran the transformation on all 6 of the numeric variables, instead of just two, and running:
> tidy(yj_estimates,number = 1)
# A tibble: 6 × 3
terms value id
<chr> <dbl> <chr>
1 crl.tot 0.000979 YeoJohnson_jKN6C
2 dollar -13.1 YeoJohnson_jKN6C
3 bang -3.88 YeoJohnson_jKN6C
4 money -14.6 YeoJohnson_jKN6C
5 n000 -13.4 YeoJohnson_jKN6C
6 make -11.0 YeoJohnson_jKN6C
...reports that the estimated lambda
values are well beyond the (-5,5)
range for all but two of the variables.