I have an old linear model which I wish to improve using XGBoost. I have the predictions from the old model, which I wish to use as a base margin. Also, due to the nature of what I'm modeling, I need to use weights. My old glm is a poisson regression with formula number_of_defaults/exposure ~ param_1 + param_2
and weights set to exposure
(same as denominator in response variable). When training the new XGBoost model on data, I do this:
xgb_model = xgb.XGBRegressor(n_estimators=25,
max_depth=100,
max_leaves=100,
learning_rate=0.01,
n_jobs=4,
eval_metric="poisson-nloglik",
nrounds=50)
model = xgb_model.fit(X=X_train, y=y_train, sample_weight=_WEIGHT, base_margin=_BASE_MARGIN)
, where _WEIGHT
and _BASE_MARGIN
are the weights and predictions (popped out of X_train).
But how do I do cross validation or out of sample analysis when I need to specify weights and base margin?
As far as I see I can use sklearn
and GridSearchCV
, but then I would need to specify weights and base margin in XGBRegressor()
(instead of in fit()
as above). The equivalent of base_margin
in XGBRegressor()
is the argument base_score
, but there is no argument for weight.
Also, I could potentially forget about doing cross-validation, and just use a training and test dataset, and I would then use eval_set
argument in XGBRegressor()
, but if I did that there is no way of specifying what is weight and what is base margin in the different sets.
Any guidance in the right direction is much appreciated!
You can use cross_val_predict
with fit_params
argument, or GridSearchCV.fit
with **fit_params
.
Here is a working proof of concept
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import cross_val_predict, GridSearchCV
import numpy as np
# Sample dataset
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
xgb_model = xgb.XGBRegressor(n_estimators=5)
fit_params = dict(sample_weight=np.abs(X[:, 0]), base_margin=np.abs(X[:, 1]))
# Simple fit
xgb_model.fit(X, y, **fit_params)
# cross_val_predict
y_pred = cross_val_predict(xgb_model, X, y, cv=3, fit_params=fit_params)
print(y_pred.shape, y.shape)
# grid search
grid = GridSearchCV(xgb_model, param_grid={"n_estimators": [5, 10, 15]})
grid.fit(X, y, **fit_params)
You can see what happen in the code source: here, here and here. The last link is where fit_params
get indexing following cross validation splits.