Search code examples
rgbm

GBM error in classification bernoulli distribution


When running the gbm function for a classification problem. I get the following error:

Error in res[flag, ] <- predictions : replacement has length zero

I would like to know why I get this error and how to solve it.

My data is about 77 numeric variables(intergers) to be used in the classification and the 1 grouping factor. No other variables are in the data. There is no missing data in the data. The grouping factor is coded as a factor (0,1) as required.

The structure of my data looks something like this:

$Group : Factor w/ 2 levels "0", "1"
$it1 : int
...
$it70 : int

my model looks like this:

mod_gbm <- gbm(Group~. distribution = "bernoulli", data=df,
               n.trees=1000,shrinkage=.01, n.minobsinnode=5, 
               interaction.depth = 6, cv.folds=5) 

I realize this question is very similar to the one here: Problems in using GBM function to do classification in R but that person was wondering about using a numeric variable and the only response was to remove cv.folds. I would like to keep cv.folds in my model and to have it run.


Solution

  • If you check out the vignette of gbm:

    distribution: Either a character string specifying the name of the
              distribution to use or a list with a component ‘name’
              specifying the distribution and any additional parameters
              needed. If not specified, ‘gbm’ will try to guess: if the
              response has only 2 unique values, bernoulli is assumed;
              otherwise, if the response is a factor, multinomial is
              assumed
    

    If you only have two classes, you don't need to convert it into a factor. We can explore this with iris example, where I create a group label 0/1 :

    library(gbm)
    df = iris
    df$Group = factor(as.numeric(df$Species=="versicolor"))
    df$Species = NULL
     
    mod_gbm <- gbm(Group~.,distribution ="bernoulli", data=df,cv.folds=5)
    Error in res[flag, ] <- predictions : replacement has length zero
    

    I get the same error. So we convert it to numeric 0/1 and you can see it works correctly.

    When the variable is a factor, doing as.numeric() converts it to 1,2 with 1 corresponding to the first level. So this case, since Group is 0/1 to start with:

    df$Group = as.numeric(df$Group)-1
    mod_gbm <- gbm(Group~.,distribution ="bernoulli", data=df,cv.folds=5)
    

    And we get the predictions:

    pred = ifelse(predict(mod_gbm,type="response")>0.5,1,0)
    table(pred,df$Group)
    
        
    pred  0  1
       0 98  3
       1  2 47