python machine-learning time-series regression xgboost

XGBoost XGBRegressor predict with different dimensions than fit

I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:

    model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
    model.fit(trainX, trainy, verbose=False)

trainX is 2000 x 19, and trainy is 2000 x 1.

In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.

When I am making a prediction:

yhat = model.predict(x_input)

x_input has to be 19 dimensions. I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.

Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.

Solution

Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).

Does xgboost supports such a feature?

Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.

Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).

I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.

You should represent missing values as float("NaN") (not as None).