Search code examples
rgbm

subscript out of bounds in gbm function


I am having a strange problem. I have successfully ran this code on my laptop, but when I try to run it on another machine first I get this warning Distribution not specified, assuming bernoulli ..., which I expect but then I get this error: Error in object$var.levels[[i]] : subscript out of bounds

library(gbm)
gbm.tmp <- gbm(subxy$presence ~ btyme + stsmi + styma + bathy,
                data=subxy,
                var.monotone=rep(0, length= 4), n.trees=2000, interaction.depth=3,
                n.minobsinnode=10, shrinkage=0.01, bag.fraction=0.5, train.fraction=1,
                verbose=F, cv.folds=10)

Can anybody help? The data structures are exactly the same, same code, same R. I am not even using a subscript here.

EDIT: traceback()

6: predict.gbm(model, newdata = my.data, n.trees = best.iter.cv)
5: predict(model, newdata = my.data, n.trees = best.iter.cv)
4: predict(model, newdata = my.data, n.trees = best.iter.cv)
3: gbmCrossValPredictions(cv.models, cv.folds, cv.group, best.iter.cv, 
       distribution, data[i.train, ], y)
2: gbmCrossVal(cv.folds, nTrain, n.cores, class.stratify.cv, data, 
       x, y, offset, distribution, w, var.monotone, n.trees, interaction.depth, 
       n.minobsinnode, shrinkage, bag.fraction, var.names, response.name, 
       group)
1: gbm(subxy$presence ~ btyme + stsmi + styma + bathy, data = subxy,var.monotone = rep(0, length = 4), n.trees = 2000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.01, bag.fraction = 0.5, train.fraction = 1, verbose = F, cv.folds = 10)

Could it have something to do because I moved the saved R workspace to another machine?

EDIT 2: ok so I have updated the gbm package on the machine where the code was working and now I get the same error. So at this point I am thinking that the older gbm package did perhaps not have this check in place or that the newer version has some problem. I don't understand gbm well enough to say.


Solution

  • just a hunch since I can't see you data, but I believe that error occurs when you have variable levels that exist in the test set which don't exist in the training set.

    this can easily happen when you have a factor variable with a high number of levels, or one level has a low number of instances.

    since you're using CV folds, it's possible the holdout set on one of the loops has foreign levels to the training data.

    I'd suggest either:

    A) use model.matrix() to one-hot encode your factor variables

    B) keep setting different seeds until you get a CV split that doesn't have this error occur.

    EDIT: yep, with that traceback, your 3rd CV holdout has a factor level in its test set that doesn't exist in the training. so the predict function sees a foreign value and doesn't know what to do.

    EDIT 2: Here's a quick example to show what I mean by "factor levels not in the test set"

    #Example data with low occurrences of a factor level:
    
    set.seed(222)
    data = data.frame(cbind( y = sample(0:1, 10, replace = TRUE), x1 = rnorm(10), x2 = as.factor(sample(0:10, 10, replace = TRUE))))
    data$x2 = as.factor(data$x2)
    data
    
          y         x1 x2
     [1,] 1 -0.2468959  2
     [2,] 0 -1.2155609  6
     [3,] 0  1.5614051  1
     [4,] 0  0.4273102  5
     [5,] 1 -1.2010235  5
     [6,] 1  1.0524585  8
     [7,] 0 -1.3050636  6
     [8,] 0 -0.6926076  4
     [9,] 1  0.6026489  3
    [10,] 0 -0.1977531  7
    
    #CV fold.  This splits a model to be trained on 80% of the data, then tests against the remaining 20%.  This is a simpler version of what happens when you call gbm's CV fold.
    
    CV_train_rows = sample(1:10, 8, replace = FALSE) ; CV_test_rows = setdiff(1:10, CV_train_rows)
    CV_train = data[CV_train_rows,] ; CV_test = data[CV_test_rows,]
    
    #build a model on the training... 
    
    CV_model = lm(y ~ ., data = CV_train)
    summary(CV_model)
    #note here: as the model has been built, it was only fed factor levels (3, 4, 5, 6, 7, 8) for variable x2
    
    CV_test$x2
    #in the test set, there are only levels 1 and 2.
    
    #attempt to predict on the test set
    predict(CV_model, CV_test)
    
    Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
    factor x2 has new levels 1, 2