Boosted regression trees - deviance values

I am fitting a BRT model using gbm package in R for the following model:

height above ground ~ Age + season + habitat + timeofday

The height above ground is a continuous variable, and so is timeofday. Season and habitat are binomial variables.

I get a very high deviance and I don't know why... Can somebody help me with the parameters?

> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+                family = "gaussian", tree.complexity = 4,
+                learning.rate = 0.01, bag.fraction = 0.50,
+                tolerance.method = "fixed",
+                tolerance = 0.01)


 GBM STEP - version 2.9 

Performing cross-validation optimisation of a boosted regression tree model 
for HAG and using a family of gaussian 
Using 15439 observations and 4 predictors 
creating 10 initial models of 50 trees 

 folds are unstratified 
total mean deviance =  55368.22 
tolerance is fixed at  0.01 
ntrees resid. dev. 
50    51050.65 
now adding trees... 
100   48935.65 
150   47805.14 
200   47193.43 
250   46841.71 
300   46631.33 
350   46498.56 
400   46418.58 
450   46371.7 
500   46336.54 
550   46317.53 
600   46309.25 
650   46300.57 
700   46296.82 
750   46297 
800   46299.11 
850   46297.7 
900   46298.34 
950   46292.32 
1000   46297.62 
1050   46295.78 
1100   46301.32 
1150   46306.59 
1200   46312.55 
1250   46314.67 
1300   46318.64 
1350   46321.38 
1400   46324.33 
1450   46322.9 
fitting final gbm model with a fixed number of 950 trees for HAG

mean total deviance = 55368.21 
mean residual deviance = 45913.34 

estimated cv deviance = 46292.32 ; se = 1366.501 

training data correlation = 0.413 
cv correlation =  0.406 ; se = 0.008 

elapsed time -  0.02 minutes

Solution

The deviance in a gbm is the mean squared error, and it will depend on the scale your dependent variable is in.

For example:

library(dismo)
library(mlbench)
data(BostonHousing)
idx=sample(nrow(BostonHousing),400)
TrnData = BostonHousing[idx,]
TestData = BostonHousing[-idx,]

The dependent variable is the last column "medv" , so we run a gbm on the raw data:

gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")

mean total deviance = 84.02 
mean residual deviance = 7.871 

estimated cv deviance = 13.959 ; se = 1.909 

training data correlation = 0.952 
cv correlation =  0.916 ; se = 0.012

You can see the mean deviance can also be calculate from your residuals (which is y - y predicted ):

mean(gbm_0$residuals^2)
[1] 7.871158

It is always good to use the testData (which the model has not been trained on). You can also check how close it is to the actual data using either correlation or MAE (mean absolute error):

pred = predict(gbm_0,TestData,1000)    
# or pearson if you like
cor(pred,TestData$medv,method="spearman")
[1] 0.8652737
# MAE
mean(abs(TestData$medv-pred))
[1] 2.75325

Visualize it, good correlation makes sense that your predictions are on average off by 3.

Now if you change the scale of your dependent variable, the deviance changes by your interpretation from correlation or MAE will stay the same:

TrnData$medv = TrnData$medv*2
TestData$medv = TestData$medv*2
gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")

mean total deviance = 336.081 
mean residual deviance = 30.983 

estimated cv deviance = 57.52 ; se = 10.254 

training data correlation = 0.953 
cv correlation =  0.911 ; se = 0.019 

elapsed time -  0.2 minutes

pred = predict(gbm_2,TestData,1000)    
cor(pred,TestData$medv,method="spearman")
[1] 0.8676821
mean(abs(TestData$medv-pred))
[1] 5.47673