I have a data frame with a bunch of factor variables that need to get converted to dummy variables for use with the xgboost
package. I'm currently using the dummyVars
function in caret
which is pretty good but kind of slow. Is there a faster way to do this conversion?
From the Matrix
package, model.matrix
and sparse.model.matrix
both do the job and I have always found them quite fast. e.g.
require(Matrix)
oat_data <- data.frame( num = c(1,2,4,8,16),
animal = c("cat","cat","dog","cat","horse"),
oats = c("likes","dislikes","dislikes","likes","dislikes"))
dense_mat <- model.matrix(~.-1,data=oat_data, verbose = F)
sparse_mat <- sparse.model.matrix(~.-1,data=oat_data, verbose = F)
dense_mat
num animalcat animaldog animalhorse oatslikes
1 1 1 0 0 1
2 2 1 0 0 0
3 4 0 1 0 0
4 8 1 0 0 1
5 16 0 0 1 0
sparse_mat
num animalcat animaldog animalhorse oatslikes
1 1 1 . . 1
2 2 1 . . .
3 4 . 1 . .
4 8 1 . . 1
5 16 . . 1 .
Very fast even with hundreds of variables that have many factors.