r validation categorical-data dummy-variable

Missing Categories in Validation Data

I built a classification model in R based on training dataset with 12 categorical predictors, each variable holds tens to hundreds of categories.

The problem is that in the dataset I use for validation, some of the variables has less categories than in the training data.

For example, if I have in the training data variable v1 with 3 categories - 'a','b','c', in the validation dataset v1 has only 2 categories - 'a','b'.

In tree based methods like decision tree or random forest it makes no problem, but in logistic regression methods (I use LASSO) that require a preparation of a dummy variables matrix, the number of columns in the training data matrix and validation data matrix doesn't match. If we go back to the example of variable v1, in the training data I get three dummy variables for v1, and in the validation data I get only 2.

Any idea how to solve this?

Solution

You can try to avoid this problem by setting the levels correctly. Look at following very stupid example:

set.seed(106)
thedata <- data.frame(
  y = rnorm(100),
  x = factor(sample(letters[1:3],100,TRUE))
)
head(model.matrix(y~x, data = thedata))
thetrain <- thedata[1:7,]
length(unique(thetrain$x))
head(model.matrix(y~x, data = thetrain))

I make a dataset with a x and a y variable, and x is a factor with 3 levels. The training dataset only has 2 levels of x, but the model matrix is still constructed correctly. That is because R kept the level data of the original dataset:

> levels(thetrain$x)
[1] "a" "b" "c"

The problem arises when your training set is somehow constructed using eg the function data.frame() or any other method that drops the levels information of the factor.

Try the following:

thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain))

You see in the second line that the level "b" has been dropped, and consequently the model matrix isn't what you want any longer. So make sure that all factors in your training dataset actually have all levels, eg:

thetrain$x <- factor(thetrain$x, levels = c("a","b","c"))

On a sidenote: if you build your model matrices yourself using either model.frame() or model.matrix(), the argument xlev might be of help:

thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain,
                  xlev = list(x = c('a','b','c'))))

Note that this xlev argument is actually from model.frame, and model.matrix doesn't call model.frame in every case. So that solution is not guaranteed to always work, but it should for data frames.