Search code examples
rdata-preprocessing

Understanding YeoJohnson transformation in R


I have tried to perform a YeoJohnson transformation using caret and recipes, but I think I am not specifying the calls properly or I am missing some extra parameters.

library(tidyverse)
library(tidytuesdayR)

# Data is all numeric except for column 7
# get it from
# https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-08-15/spam.csv
# or load it with tt_load()
spam <- tt_load(2023, week=33)$spam


# pre-process
pp_hpc <- caret::preProcess(spam[,1:6], 
                            method = c("center", "scale", "YeoJohnson"))
# fails to transform variables all variables
pp_hpc
Created from 4601 samples and 6 variables

Pre-processing:
  - centered (6)
  - ignored (0)
  - scaled (6)
  - Yeo-Johnson transformation (1)

Lambda estimates for Yeo-Johnson transformation:
0
# I can apply the transformation but obviously doesn't do the expected transformation in all the columns
transformed <- predict(pp_hpc, newdata = df$spam[,1:6])

Trying with recipes now

# recipes package 
library(recipes)
# do I really need this just to transform the data?
rec <- recipe(
  yesno ~ .,
  data = spam
)

yj_transform <- step_YeoJohnson(rec, all_numeric())
# only transform some variables
yj_estimates <- prep(yj_transform, verbose = T)
yj_estimates

── Recipe ────────────────────────────────────────────────

── Inputs 
Number of variables by role
outcome:   1
predictor: 6

── Training information 
Training data contained 4601 data points and no
incomplete rows.

── Operations 
• Yeo-Johnson transformation on: crl.tot, bang | Trained

Same, applying works, but not all columns are transformed (I am also not centering/scaling since it is not the problem here).

yj_te <- bake(yj_estimates, spam)

The bestNormalize package seems to have no issue here:

# works as expected
df_transformed <- select(spam, where(is.numeric)) %>% 
  mutate_all(.funs = function(x) predict(bestNormalize::yeojohnson(x), newdata = x))

Just in case, this is how I would do it in python or using reticulate

# Python version
library(reticulate)
repl_python()
from sklearn import preprocessing
X = r.spam.drop('yesno', axis = 1)
scaler = preprocessing.PowerTransformer().set_output(transform="pandas")
X = scaler.fit_transform(X)

Solution

  • So the Yeo-Johnson transformation requires a lambda value which either needs to be supplied or estimated. step_YeoJohnson() estimates a suitable value for each variable.

    The limits argument sets a default range of values to search in. It defaults to c(-5,5).

    The documentation states:

    If the transformation parameters are estimated to be very closed to the bounds, or if the optimization fails, a value of NA is used and no transformation is applied.

    So based on that, if you increase the bounds to search for suitable values of lambda in you'll likely see more variables being transformed. Indeed, when I run:

    yj_transform <- step_YeoJohnson(rec, all_numeric(),limits = c(-20,20))
    # only transform some variables
    yj_estimates <- prep(yj_transform, verbose = T)
    yj_estimates
    

    The summary output reports that it ran the transformation on all 6 of the numeric variables, instead of just two, and running:

    > tidy(yj_estimates,number = 1)
    # A tibble: 6 × 3
      terms        value id              
      <chr>        <dbl> <chr>           
    1 crl.tot   0.000979 YeoJohnson_jKN6C
    2 dollar  -13.1      YeoJohnson_jKN6C
    3 bang     -3.88     YeoJohnson_jKN6C
    4 money   -14.6      YeoJohnson_jKN6C
    5 n000    -13.4      YeoJohnson_jKN6C
    6 make    -11.0      YeoJohnson_jKN6C
    

    ...reports that the estimated lambda values are well beyond the (-5,5) range for all but two of the variables.