Normalization e.g. z-scoring is a common preprocessing method in Machine Learning. I am analyzing a dataset and use ensemble methods like Random Forests or the XGBOOST framework.
Now I compare models using
Using crossvalidation I observe in both cases that with higher max_depth
parameter the training error decreases.
For the 1. case the test error also decreases and saturates at a certain MAE:
For the z-scored features however the test error is non decreasing at all.
In this question: https://datascience.stackexchange.com/questions/16225/would-you-recommend-feature-normalization-when-using-boosting-trees it was discussed that normalization is not necessary for tree based methods. But the example above shows that it has a severe effect.
So I have two questions regarding this:
Thanks!
It is not easy to see what is going on in the absence of any code or data.
Normalisation may or may not be helpful depending on the particular data and how the normalisation step is applied. Tree based methods ought to be robust enough to handle the raw data. In your cross validations is your code doing the normalisation separately for each fold? Doing a single normalisation prior to cv may lead to significant leakage.
With very high values of depth you will have a much more complex model that will fit the training data well but will fail to generalise to new data. I tend to prefer max depths from 2 to 5. If I can't get a reasonable model I turn my efforts to feature engineering rather than trying to tweak the hyperparameters too much.