Search code examples
rsparse-matrixxgboostdummy-variable

Fastest dummy variable conversion package / function


I have a data frame with a bunch of factor variables that need to get converted to dummy variables for use with the xgboost package. I'm currently using the dummyVars function in caret which is pretty good but kind of slow. Is there a faster way to do this conversion?


Solution

  • From the Matrix package, model.matrix and sparse.model.matrix both do the job and I have always found them quite fast. e.g.

    require(Matrix)
    oat_data <- data.frame(  num = c(1,2,4,8,16), 
                          animal = c("cat","cat","dog","cat","horse"), 
                            oats = c("likes","dislikes","dislikes","likes","dislikes"))
    
    dense_mat <- model.matrix(~.-1,data=oat_data, verbose = F)
    sparse_mat <- sparse.model.matrix(~.-1,data=oat_data, verbose = F)
    
    dense_mat
      num animalcat animaldog animalhorse oatslikes
    1   1         1         0           0         1
    2   2         1         0           0         0
    3   4         0         1           0         0
    4   8         1         0           0         1
    5  16         0         0           1         0
    
    sparse_mat
      num animalcat animaldog animalhorse oatslikes
    1   1         1         .           .         1
    2   2         1         .           .         .
    3   4         .         1           .         .
    4   8         1         .           .         1
    5  16         .         .           1         .
    

    Very fast even with hundreds of variables that have many factors.