Search code examples
rdummy-variableone-hot-encoding

R dummy/onehot-encoding with fixed column structure


Assume my machine learning training dataset contains 3 columns with categories up to 50 different levels. I one-hot encode the columns. The test dataset only has one row. Ho can I maintain the structure of the training dataset when I encode the test dataset?

Everything works fine for the training data ...

v1 <- factor(c("a","b","c","a"))
v2 <- factor(c("A","A","B","C"))
train <- data.frame(v1 = v1,v2 = v2)
train_dummy <- as.data.frame(model.matrix(~ v1 + v2 -1 , data=train, 
    contrasts.arg=list(v1=contrasts(train$v1, contrasts=F), 
            v2=contrasts(train$v2, contrasts=F))))
print(train)
v1 v2
a  A
b  A
c  B
a  C

print(train_dummy )
v1a v1b v1c v2A v2B v2C
1   0   0   1   0   0
0   1   0   1   0   0
0   0   1   0   1   0
1   0   0   0   0   1

...but for the test data it fails. When I try to apply the training data's factor levels to the test data it does not work:

test <-  data.frame(v1 = factor("a"),v2 = factor("A"))
test_dummy <- as.data.frame(model.matrix(~ v1 + v2 -1 , data=test, 
    contrasts.arg=list(v1=contrasts(train$v1, contrasts=F), 
            v2=contrasts(train$v2, contrasts=F))))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

Of course I can Row bind train and test data and then do the dummy encoding, but this is production code and I cannot accept this as the only solution:

train_test <- rbind(train,test)
train_test_dummy <- as.data.frame(model.matrix(~ v1 + v2 -1 , data=train_test, 
     contrasts.arg=list(v1=contrasts(train_test$v1, contrasts=F), 
          v2=contrasts(train_test$v2, contrasts=F))))

print(train_test_dummy)
v1a v1b v1c v2A v2B v2C
1   0   0   1   0   0
0   1   0   1   0   0
0   0   1   0   1   0
1   0   0   0   0   1
1   0   0   1   0   0

Is there anything better?

This is a duplicate but the question was not answered and all the other questions only address generating dummy variables from one dataset.


Solution

  • If you additionally add

    levels(test$v1) <- levels(train$v1)
    levels(test$v2) <- levels(train$v2)
    

    or, in one line if all the columns are factors,

    test[] <- Map(function(x, y) factor(x, level = levels(y)), test, train)
    

    and if only some of them are factors,

    test[] <- Map(function(x, y) if(is.factor(x)) factor(x, level = levels(y)) else x, test, train)
    

    Then the final result is as needed:

    test_dummy
    #   v1a v1b v1c v2A v2B v2C
    # 1   1   0   0   1   0   0