I have a data frame called b
. I split this into a training set and test set.
smp_size <- floor(0.75 * nrow(b))
set.seed(123)
train_ind <- sample(seq_len(nrow(b)), size = smp_size)
b_train <- b[train_ind, ]
b_test <- b[-train_ind, ]
b
contains a variable/column, let's say x
, that I use as factor()
with many different categories.
I use b_train
to get a linear model with the function lm()
. After that I use the function predict()
with the lm()
object and b_test
. Unfortunately, b_train$x
does not include all different types of characters in b$x
. Therefore, the function predict()
can not be used, since b_test$x
contains categories that are not in b_train$x
.
How to make sure that all types of categories are included in b_train$x
?
This can be easily done using caret package's createDataPartition() function.
library(caret)
samp = createDataPartition(as.factor(b$x), p = 0.75, list = F)
train = b[samp,]
test = b[-samp,]