I've recently faced a "strange" observation in my dataset. After XGB modeling with 20 features I plot top 10 features with the highest gain values. Result is shown below:
F1 140027.061202
F2 11242.470370
F3 9957.161039
F4 9677.070632
F5 7103.275865
F6 4691.814929
F7 4030.730915
F8 2775.235616
F9 2384.573760
F10 2328.680871
As you can see F1 dominates with gain compared to all other features (12x more gain than F2). I verified the results on test set, model is not overfitting and it gives a decent results (comparing to my figures of merit):
F1-score: 0.739812237993
Accuracy: 0.839632893701
Precision: 0.63759578607
Recall: 0.881059718486
Based on these results is it correct to conclude that F1 feature is enough for building a model?
In order to prove this, I re-run the modeling with the same parameters, but now having F1 as a standalone feature. Results are just slightly worse than previous (and no over-fit):
F1-score: 0.710906846703
Accuracy: 0.819880412472
Precision: 0.607953806173
Recall: 0.85583736242
My XGB parameters are super simple in both cases:
alg = XGBRegressor(
n_estimators=200,
max_depth=5,
objective='binary:logistic',
seed=27,
)
# Fit the algorithm on the data
metric = 'map'
alg.fit(X_train, y_train, eval_metric=metric)
After I exclude feature F1 and re-fit the model I get the similar verification metrics (slightly worse) but in that case feature F3 becomes "dominant" with a really high gain ~ 10000 while feature F2 is the next one with gain value ~ 10000.
Thanks!
Have you tried adding and tuning additional parameters and using grid search to find the optimal combination? To prevent over fitting I can suggest adding:
Since you are using XGBRegressor, try modifying the objective function. I can also suggest monitoring the validation and training loss when building the trees.