Search code examples
rtraining-data

Split data into training and test set: How to make sure all factors are included in training set?


I have a data frame called b. I split this into a training set and test set.

smp_size <- floor(0.75 * nrow(b))
set.seed(123)
train_ind <- sample(seq_len(nrow(b)), size = smp_size)
b_train <- b[train_ind, ]
b_test <- b[-train_ind, ]

b contains a variable/column, let's say x, that I use as factor() with many different categories.

I use b_train to get a linear model with the function lm(). After that I use the function predict() with the lm() object and b_test. Unfortunately, b_train$x does not include all different types of characters in b$x. Therefore, the function predict() can not be used, since b_test$x contains categories that are not in b_train$x.

How to make sure that all types of categories are included in b_train$x ?


Solution

  • This can be easily done using caret package's createDataPartition() function.

    library(caret)
    samp = createDataPartition(as.factor(b$x), p = 0.75, list = F)
    
    train = b[samp,]
    test = b[-samp,]