Search code examples
rgbm

Boosted regression trees - deviance values


I am fitting a BRT model using gbm package in R for the following model:

height above ground ~ Age + season + habitat + timeofday

The height above ground is a continuous variable, and so is timeofday. Season and habitat are binomial variables.

I get a very high deviance and I don't know why... Can somebody help me with the parameters?

> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+                family = "gaussian", tree.complexity = 4,
+                learning.rate = 0.01, bag.fraction = 0.50,
+                tolerance.method = "fixed",
+                tolerance = 0.01)


 GBM STEP - version 2.9 

Performing cross-validation optimisation of a boosted regression tree model 
for HAG and using a family of gaussian 
Using 15439 observations and 4 predictors 
creating 10 initial models of 50 trees 

 folds are unstratified 
total mean deviance =  55368.22 
tolerance is fixed at  0.01 
ntrees resid. dev. 
50    51050.65 
now adding trees... 
100   48935.65 
150   47805.14 
200   47193.43 
250   46841.71 
300   46631.33 
350   46498.56 
400   46418.58 
450   46371.7 
500   46336.54 
550   46317.53 
600   46309.25 
650   46300.57 
700   46296.82 
750   46297 
800   46299.11 
850   46297.7 
900   46298.34 
950   46292.32 
1000   46297.62 
1050   46295.78 
1100   46301.32 
1150   46306.59 
1200   46312.55 
1250   46314.67 
1300   46318.64 
1350   46321.38 
1400   46324.33 
1450   46322.9 
fitting final gbm model with a fixed number of 950 trees for HAG

mean total deviance = 55368.21 
mean residual deviance = 45913.34 

estimated cv deviance = 46292.32 ; se = 1366.501 

training data correlation = 0.413 
cv correlation =  0.406 ; se = 0.008 

elapsed time -  0.02 minutes 

Solution

  • The deviance in a gbm is the mean squared error, and it will depend on the scale your dependent variable is in.

    For example:

    library(dismo)
    library(mlbench)
    data(BostonHousing)
    idx=sample(nrow(BostonHousing),400)
    TrnData = BostonHousing[idx,]
    TestData = BostonHousing[-idx,]
    

    The dependent variable is the last column "medv" , so we run a gbm on the raw data:

    gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
    
    mean total deviance = 84.02 
    mean residual deviance = 7.871 
    
    estimated cv deviance = 13.959 ; se = 1.909 
    
    training data correlation = 0.952 
    cv correlation =  0.916 ; se = 0.012 
    

    You can see the mean deviance can also be calculate from your residuals (which is y - y predicted ):

    mean(gbm_0$residuals^2)
    [1] 7.871158
    

    It is always good to use the testData (which the model has not been trained on). You can also check how close it is to the actual data using either correlation or MAE (mean absolute error):

    pred = predict(gbm_0,TestData,1000)    
    # or pearson if you like
    cor(pred,TestData$medv,method="spearman")
    [1] 0.8652737
    # MAE
    mean(abs(TestData$medv-pred))
    [1] 2.75325
    

    Visualize it, good correlation makes sense that your predictions are on average off by 3.

    enter image description here

    Now if you change the scale of your dependent variable, the deviance changes by your interpretation from correlation or MAE will stay the same:

    TrnData$medv = TrnData$medv*2
    TestData$medv = TestData$medv*2
    gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
    
    mean total deviance = 336.081 
    mean residual deviance = 30.983 
    
    estimated cv deviance = 57.52 ; se = 10.254 
    
    training data correlation = 0.953 
    cv correlation =  0.911 ; se = 0.019 
    
    elapsed time -  0.2 minutes
    
    pred = predict(gbm_2,TestData,1000)    
    cor(pred,TestData$medv,method="spearman")
    [1] 0.8676821
    mean(abs(TestData$medv-pred))
    [1] 5.47673