Search code examples
pythonxgboostcross-validationweighted

XGBRegressor with weights and base_margin: out of sample validation possible?


I have an old linear model which I wish to improve using XGBoost. I have the predictions from the old model, which I wish to use as a base margin. Also, due to the nature of what I'm modeling, I need to use weights. My old glm is a poisson regression with formula number_of_defaults/exposure ~ param_1 + param_2 and weights set to exposure (same as denominator in response variable). When training the new XGBoost model on data, I do this:

xgb_model = xgb.XGBRegressor(n_estimators=25,
                             max_depth=100,
                             max_leaves=100,
                             learning_rate=0.01,
                             n_jobs=4,
                             eval_metric="poisson-nloglik",
                             nrounds=50)

model = xgb_model.fit(X=X_train, y=y_train, sample_weight=_WEIGHT, base_margin=_BASE_MARGIN)

, where _WEIGHT and _BASE_MARGIN are the weights and predictions (popped out of X_train). But how do I do cross validation or out of sample analysis when I need to specify weights and base margin?

As far as I see I can use sklearn and GridSearchCV, but then I would need to specify weights and base margin in XGBRegressor() (instead of in fit() as above). The equivalent of base_margin in XGBRegressor() is the argument base_score, but there is no argument for weight.

Also, I could potentially forget about doing cross-validation, and just use a training and test dataset, and I would then use eval_set argument in XGBRegressor(), but if I did that there is no way of specifying what is weight and what is base margin in the different sets.

Any guidance in the right direction is much appreciated!


Solution

  • You can use cross_val_predict with fit_params argument, or GridSearchCV.fit with **fit_params.

    Here is a working proof of concept

    import xgboost as xgb
    from sklearn import datasets
    from sklearn.model_selection import cross_val_predict, GridSearchCV
    import numpy as np
    
    # Sample dataset
    diabetes = datasets.load_diabetes()
    X = diabetes.data[:150]
    y = diabetes.target[:150]
    
    xgb_model = xgb.XGBRegressor(n_estimators=5)
    fit_params = dict(sample_weight=np.abs(X[:, 0]), base_margin=np.abs(X[:, 1]))
    
    # Simple fit
    xgb_model.fit(X, y, **fit_params)
    
    # cross_val_predict
    y_pred = cross_val_predict(xgb_model, X, y, cv=3, fit_params=fit_params)
    print(y_pred.shape, y.shape)
    
    # grid search
    grid = GridSearchCV(xgb_model, param_grid={"n_estimators": [5, 10, 15]})
    grid.fit(X, y, **fit_params)
    

    You can see what happen in the code source: here, here and here. The last link is where fit_params get indexing following cross validation splits.