xgboost Random Forest with sparse matrix data and multinomial Y

I'm not sure if xgboost's many nice features can be combined in the way that I need (?), but what I'm trying to do is to run a Random Forest with sparse data predictors on a multi-class dependent variable.

I know that xgboost can do any 1 of those things:

Random Forest via tweaking xgboost parameters:

bst <- xgboost(data = train$data, label = train$label, max.depth = 4, 
               num_parallel_tree = 1000, subsample = 0.5,
               colsample_bytree =0.5, nround = 1, 
               objective = "binary:logistic")

Sparse matrix predictors

bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4, 
               eta = 1, nthread = 2, 
               nround = 10,objective = "binary:logistic")

Multinomial (multiclass) dependent variable models via multi:softmax or multi:softprob

xgboost(data = data, label = multinomial_vector, max.depth = 4,
        eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")

However, I run into an error regarding non-conforming length when I try to do all of them at once:

sparse_matrix     <- sparse.model.matrix(TripType~.-1, data = train)
Y                 <- train$TripType
bst               <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) : 
  The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925

The length error I'm getting is comparing the length of my single multi-class dependent vector (let's call it n) to the length of the sparse matrix index, which I believe is j * n for j predictors.

The specific use case here is the Kaggle.com Walmart competition (the data is public, but very large by default -- about 650,000 rows and several thousand candidate features). I've been running multinomial RF models on it via H2O, but it sounds like a lot of other folks have been using xgboost, so I wonder if this is possible.

If it's not possible, then I wonder if one could/should estimate each level of the dependent variable separately and try to combine the results?

Solution

Here is what is happening:

When you do this:

sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)

you are losing rows from your data

sparse.model.matrix cannot deal with NA's by default, when it see's one, it drops the row

as it happens there are exactly 4129 rows that contain NA's in the original data.

This is the difference between these two numbers:

length(Y)
[1] 647054

nrow(sparse_matrix)
[1] 642925

The reason this works on the previous examples is as follows

In the binomial case :

it is recycling the Y vector and completing the missing labels. (this is BAD)

In the random forest case:

(I think) it's because I random forest never uses the predictions from previous trees, so this error goes unseen. (this is BAD)

Takeaway:

Neither of the previous examples that work will train well

sparse.model.matrix drops NA's you are losing rows in your training data, this is a big problem and needs to be addressed

Good luck!