Search code examples
rone-hot-encoding

r one-hot encoding for one variable in a dataset


I have a dataset where I would like to one-hot encode one variable and built a model (lm).

This variable is called 'zone'.

What I tried to do is:

lm_model <- train(formula(paste0("price ~", paste0(features, collapse = " + "))),
                  data = predict(dummyVars( ~ "zone", data = data_train), newdata =  data_train), 
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 10),
                  preProcess = c("center", "scale"),
                  na.action=na.exclude
)

I am not sure that regarding the part, could someone please guide me here:

data = predict(dummyVars( ~ "zone", data = data_train), newdata =  data_train), 

Solution

  • Let's use an example with cyl as a categorical from mtcars:

    library(caret)
    da <- mtcars
    da$cyl <- factor(da$cyl)
    # we can include cyl as features
    features <- c("cyl","hp","drat","wt","qsec")
    #our dependent is mpg
    

    We check what dummyVars does:

        head(predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
                      cyl.4 cyl.6 cyl.8  hp drat    wt  qsec
    Mazda RX4             0     1     0 110 3.90 2.620 16.46
    Mazda RX4 Wag         0     1     0 110 3.90 2.875 17.02
    Datsun 710            1     0     0  93 3.85 2.320 18.61
    Hornet 4 Drive        0     1     0 110 3.08 3.215 19.44
    Hornet Sportabout     0     0     1 175 3.15 3.440 17.02
    Valiant               0     1     0 105 2.76 3.460 20.22
    

    You can see it introduces 3 binary variables for cyl, and also keeps the continuous variables. the dependent variable is not in this predict(...)

    So for the training:

    onehot_data <- cbind(mpg=da$mpg,
    predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
    
    lm_model <- train(mpg ~.,data=onehot_data,  
                      method = "lm", 
                      trControl = trainControl(method = "cv", number = 10),
                      preProcess = c("center", "scale"),
                      na.action=na.exclude
    )
    

    And it throws you a warning:

    Warning messages:
    1: In predict.lm(modelFit, newdata) :
      prediction from a rank-deficient fit may be misleading
    

    For linear models, caret fits a model with intercept. Because you have only one categorical value, your intercept will be a linear combination of your onehot encoded variables.

    You need to decide which of your categorical will be a reference level, and remove that column from the onehot data frame, for example:

    # i remove cyl.4
    onehot_data = onehot_data[,-2]
    lm_model <- train(mpg ~.,data=onehot_data,  
                      method = "lm", 
                      trControl = trainControl(method = "cv", number = 10),
                      preProcess = c("center", "scale"),
                      na.action=na.exclude
    )