I am trying to build different regression models using the R package caret
. For the data, it includes both numerical values and factors.
Question 1: What is the correct way to include both numerical values and factors in a regression model in caret
?
Question 2: As data preprocessing (center and scale) is usually required for a regression model, how does the preprocessing work for factors?
library(caret)
data("mtcars")
mydata = mtcars[, -c(8,9)]
set.seed(100)
mydata$dir = sample(x=c("N", "E", "S", "W"), size = 32, replace = T)
mydata$dir = as.factor(mydata$dir)
class(mydata$dir) # Factor with four levels
MyControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
verboseIter = TRUE,
savePredictions = "final"
)
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
model_pls <- train(
hp ~ .,
data = mydata,
method = "pls",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
model_rf <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "ranger",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
model_knn <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "knn",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
model_svmr <- train(
hp ~ .,
data = mydata,
tuneLength = 5,
method = "svmRadial",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
According to the documenation of the function train
the input x can be:
a data frame containing training data where samples are in rows and features are in columns.
And y the target input:
A numeric or factor vector containing the outcome for each sample.
So you could use numerical and factor variables. Using the ~.
notation, you will cover all variables in x. Preprocessing numerical variables with center and scale is indeed a good approach. In the train
function you can use the preProcess
argument to scale and center. This will ignore your factor variables according to the documentation:
a matrix or data frame. Non-numeric predictors are allowed but will be ignored.
On page 550 of the book Applied Predictive Modeling (Authors Max Kuhn, Kjell Johnson), you can find what pre-processing methods are suggested for Linear regression. It says:
An option to handle factor variables in preprocessing is using dummy variables. In caret you can use the function dummyVars
to convert the factor variables to a numeric variable.
example code:
library(caret)
library(tibble)
# Data
data("mtcars")
mydata = mtcars[, -c(8,9)]
set.seed(100)
mydata$dir = sample(x=c("N", "E", "S", "W"), size = 32, replace = T)
mydata$dir = as.factor(mydata$dir)
class(mydata$dir) # Factor with four levels
#> [1] "factor"
# Create dummy variables
dummy_mydata <- dummyVars(hp~., data = mydata)
dummy_mydata_updated <- as_tibble(predict(dummy_mydata, newdata = mydata))
# remember to include the outcome variable too
dummy_mydata_updated <- cbind(hp = mydata$hp, dummy_mydata_updated)
head(dummy_mydata_updated)
#> hp mpg cyl disp drat wt qsec gear carb dir.E dir.N dir.S dir.W
#> 1 110 21.0 6 160 3.90 2.620 16.46 4 4 1 0 0 0
#> 2 110 21.0 6 160 3.90 2.875 17.02 4 4 0 0 1 0
#> 3 93 22.8 4 108 3.85 2.320 18.61 4 1 1 0 0 0
#> 4 110 21.4 6 258 3.08 3.215 19.44 3 1 0 0 0 1
#> 5 175 18.7 8 360 3.15 3.440 17.02 3 2 0 0 1 0
#> 6 105 18.1 6 225 2.76 3.460 20.22 3 1 0 1 0 0
MyControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
verboseIter = TRUE,
savePredictions = "final"
)
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
plot(model_glm$pred$pred, model_glm$pred$obs,
xlab='Predicted Values',
ylab='Actual Values',
main='Predicted vs. Actual Values')
abline(a=0, b=1)
model_glm_dummy <- train(
hp ~ .,
data = dummy_mydata_updated,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = MyControl
)
plot(model_glm_dummy$pred$pred, model_glm_dummy$pred$obs,
xlab='Predicted Values',
ylab='Actual Values',
main='Predicted vs. Actual Values')
abline(a=0, b=1)
# Results
model_glm$results
#> parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
#> 1 none 37.13849 0.7302309 32.09739 11.50226 0.2143993 10.10452
model_glm_dummy$results
#> parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
#> 1 none 35.71861 0.8095385 29.678 8.959409 0.1193792 5.616908
Created on 2022-10-23 with reprex v2.0.2
As you can see there is a slightly difference in results between the two models.
For a really usefull source check this website