python xgboost cross-validation weighted

XGBRegressor with weights and base_margin: out of sample validation possible?

I have an old linear model which I wish to improve using XGBoost. I have the predictions from the old model, which I wish to use as a base margin. Also, due to the nature of what I'm modeling, I need to use weights. My old glm is a poisson regression with formula number_of_defaults/exposure ~ param_1 + param_2 and weights set to exposure (same as denominator in response variable). When training the new XGBoost model on data, I do this:

xgb_model = xgb.XGBRegressor(n_estimators=25,
                             max_depth=100,
                             max_leaves=100,
                             learning_rate=0.01,
                             n_jobs=4,
                             eval_metric="poisson-nloglik",
                             nrounds=50)

model = xgb_model.fit(X=X_train, y=y_train, sample_weight=_WEIGHT, base_margin=_BASE_MARGIN)

, where _WEIGHT and _BASE_MARGIN are the weights and predictions (popped out of X_train). But how do I do cross validation or out of sample analysis when I need to specify weights and base margin?

As far as I see I can use sklearn and GridSearchCV, but then I would need to specify weights and base margin in XGBRegressor() (instead of in fit() as above). The equivalent of base_margin in XGBRegressor() is the argument base_score, but there is no argument for weight.

Also, I could potentially forget about doing cross-validation, and just use a training and test dataset, and I would then use eval_set argument in XGBRegressor(), but if I did that there is no way of specifying what is weight and what is base margin in the different sets.

Any guidance in the right direction is much appreciated!

Solution

You can use cross_val_predict with fit_params argument, or GridSearchCV.fit with **fit_params.

Here is a working proof of concept

import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import cross_val_predict, GridSearchCV
import numpy as np

# Sample dataset
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]

xgb_model = xgb.XGBRegressor(n_estimators=5)
fit_params = dict(sample_weight=np.abs(X[:, 0]), base_margin=np.abs(X[:, 1]))

# Simple fit
xgb_model.fit(X, y, **fit_params)

# cross_val_predict
y_pred = cross_val_predict(xgb_model, X, y, cv=3, fit_params=fit_params)
print(y_pred.shape, y.shape)

# grid search
grid = GridSearchCV(xgb_model, param_grid={"n_estimators": [5, 10, 15]})
grid.fit(X, y, **fit_params)

You can see what happen in the code source: here, here and here. The last link is where fit_params get indexing following cross validation splits.