Search code examples
python-3.xxgboostboosting

Why does xgboost produce the same predictions and nan values for features when using entire dataset?


Summary

I am using Python v3.7 and xgboost v0.81. I have continuous data (y) at a US state level by each week from 2015 to 2019. I'm trying to regress on the following features to y: year, month, week, region (encoded). I've set the train as August 2018 and before and the test is September 2018 and onward. When I train the model this way, two weird things happen:

  • feature_importances are all nan
  • predictions are all the same (0.5, 0.5....)

What I've tried

Fixing any of the features to a single variable allows the model to train appropriately and the two weird issues encountered previously are gone. Ex. year==2017 or region==28

Code

(I know this is a temporal problem but this general case exhibits the problem as well)

X = df[['year', 'month', 'week', 'region_encoded']]
display(X)
y = df.target
display(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.1)

model = XGBRegressor(n_jobs=-1, n_estimators=1000).fit(X_train, y_train)

display(model.predict(X_test)[:20])

display(model.feature_importances_)

Results - some of the predictions and the feature importances

year    month   week    region_encoded
0   2015    10  40  0
1   2015    10  40  1
2   2015    10  40  2
3   2015    10  40  3
4   2015    10  40  4

0    272.0
1     10.0
2    290.0
3     46.0
4    558.0
Name: target, dtype: float64

array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], dtype=float32)

array([nan, nan, nan, nan], dtype=float32)

Solution

  • If the target variable has NaN in it, even just one, that is enough for many machine learning algorithms to break. This is usually because when an unhandled NaN is present in the target variable in the update step of many ML algorithms for example computing derivatives, the NaN propagates. Although, I cannot say too much about which step in XGBoost does this.

    For example, the analytical solution for linear regression.

    import numpy as np
    import numpy.linalg as la
    from scipy import stats
    
    y = np.array([0, 1, 2, 3, np.nan, 5, 6, 7, 8, 9])
    x = stats.norm().rvs((len(y), 3))
    
    # Main effects estimate
    m_hat = la.inv(x.T @ x) @ x.T @ y
    >>> [nan nan nan]