I am fitting a BRT model using gbm package in R for the following model:
height above ground ~ Age + season + habitat + timeofday
The height above ground is a continuous variable, and so is timeofday. Season and habitat are binomial variables.
I get a very high deviance and I don't know why... Can somebody help me with the parameters?
> M1 <- gbm.step(data=data, gbm.x = 2:5, gbm.y = 1,
+ family = "gaussian", tree.complexity = 4,
+ learning.rate = 0.01, bag.fraction = 0.50,
+ tolerance.method = "fixed",
+ tolerance = 0.01)
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for HAG and using a family of gaussian
Using 15439 observations and 4 predictors
creating 10 initial models of 50 trees
folds are unstratified
total mean deviance = 55368.22
tolerance is fixed at 0.01
ntrees resid. dev.
50 51050.65
now adding trees...
100 48935.65
150 47805.14
200 47193.43
250 46841.71
300 46631.33
350 46498.56
400 46418.58
450 46371.7
500 46336.54
550 46317.53
600 46309.25
650 46300.57
700 46296.82
750 46297
800 46299.11
850 46297.7
900 46298.34
950 46292.32
1000 46297.62
1050 46295.78
1100 46301.32
1150 46306.59
1200 46312.55
1250 46314.67
1300 46318.64
1350 46321.38
1400 46324.33
1450 46322.9
fitting final gbm model with a fixed number of 950 trees for HAG
mean total deviance = 55368.21
mean residual deviance = 45913.34
estimated cv deviance = 46292.32 ; se = 1366.501
training data correlation = 0.413
cv correlation = 0.406 ; se = 0.008
elapsed time - 0.02 minutes
The deviance in a gbm is the mean squared error, and it will depend on the scale your dependent variable is in.
For example:
library(dismo)
library(mlbench)
data(BostonHousing)
idx=sample(nrow(BostonHousing),400)
TrnData = BostonHousing[idx,]
TestData = BostonHousing[-idx,]
The dependent variable is the last column "medv" , so we run a gbm on the raw data:
gbm_0 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 84.02
mean residual deviance = 7.871
estimated cv deviance = 13.959 ; se = 1.909
training data correlation = 0.952
cv correlation = 0.916 ; se = 0.012
You can see the mean deviance can also be calculate from your residuals (which is y - y predicted ):
mean(gbm_0$residuals^2)
[1] 7.871158
It is always good to use the testData (which the model has not been trained on). You can also check how close it is to the actual data using either correlation or MAE (mean absolute error):
pred = predict(gbm_0,TestData,1000)
# or pearson if you like
cor(pred,TestData$medv,method="spearman")
[1] 0.8652737
# MAE
mean(abs(TestData$medv-pred))
[1] 2.75325
Visualize it, good correlation makes sense that your predictions are on average off by 3.
Now if you change the scale of your dependent variable, the deviance changes by your interpretation from correlation or MAE will stay the same:
TrnData$medv = TrnData$medv*2
TestData$medv = TestData$medv*2
gbm_2 = gbm.step(data=TrnData,gbm.x=1:13,gbm.y=14,family="gaussian")
mean total deviance = 336.081
mean residual deviance = 30.983
estimated cv deviance = 57.52 ; se = 10.254
training data correlation = 0.953
cv correlation = 0.911 ; se = 0.019
elapsed time - 0.2 minutes
pred = predict(gbm_2,TestData,1000)
cor(pred,TestData$medv,method="spearman")
[1] 0.8676821
mean(abs(TestData$medv-pred))
[1] 5.47673