Search code examples
pythonxgboostshap

SHAP Values Aren't The Same As XGBoost Model's Predictions


Let me preface this by saying in the past two days I have taught myself vaguely how to use this program so it is totally possible I making an incredibly simple mistake, but any help is greatly appreciated. I am trying to use SHAP's waterfall to visualize the impact of various variables on the prediction of my XGBoost model. The model takes in 13 variables about the team's salary and then predicts the team's rank. The model works well but when I try to use SHAP their values are wrong. As far as I understand, the f(x) in the top right of the waterfall is supposed to be the same as the model prediction but that is not at all the case.

Here is my code:

import shap
from joblib import dump, load
import xgboost as xgb
import pandas as pd
import numpy as np

filen = f"D:\miniconda-keep\Created Data\Done Data - Copy.csv"
X = pd.read_csv(filen).iloc[:,3:-3].div(10000).astype(int)
y = pd.read_csv(filen).iloc[:,-1:].astype(int).subtract(1)

model = load("D:\miniconda-keep\Saved Will Made Files\Models\Successful_XGBoost_Model.joblib")


explainer = shap.Explainer(model)
shap_values = explainer(X)
pred = model.predict(X)

#EDIT THE VARIABLE BELOW TO LOOK AT DIFFERENT TEAMS
to_pred = 21

print(X.iloc[to_pred].subtract(X.mean(axis=0)))

print('Team:',pd.read_csv(filen).iloc[to_pred,1])
print(f"model pred {pred[to_pred]+1}")

shap.plots.waterfall(shap_values[to_pred,:,pred[to_pred]])

This is the output and waterfall plot:

Average Salary                 73.311005
Highest Salary               1280.382775
Number of Homegrowns            0.000000
Salary IQR                     25.593301
Salary Standard Deviation     239.521531
Average GK Salary               5.866029
Average Defender Salary        10.866029
Average Midfielder Salary     118.449761
Average Attacker Salary       137.674641
Highest Goalkeeper Salary      32.688995
dtype: float64
Team: Toronto FC
model pred 15

SHAP Waterfall Output

A working example from SHAP's API Examples page:

import xgboost

import shap

# train XGBoost model
X, y = shap.datasets.adult()
model = xgboost.XGBClassifier().fit(X, y)

# compute SHAP values
explainer = shap.Explainer(model, X)
shap_values = explainer(X)

shap.plots.waterfall(shap_values[0])

This outputs:

SHAP Example Successful Waterfall

Thank you so much for any help!


Solution

  • The SHAP output was the log odds of the model making that prediction, per @MichaelM. This was because the model was an XGBoost classifier, not a regressor. With an XGBoost regressor, the SHAP value is in fact the model prediction.

    With an XGBoost regressor:

    import xgboost
    import shap
    import pandas as pd
    from sklearn.model_selection import RepeatedKFold
    from sklearn.model_selection import cross_val_score
    from numpy import absolute
    
    
    filen = "D:\miniconda-keep\Created Data\Done Data - Copy.csv"
    X, y=pd.read_csv(filen).iloc[:,3:-3].div(10000), pd.read_csv(filen).iloc[:,-1:]
    
    model = xgboost.XGBRegressor()
    # define model evaluation method
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate model
    scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
    # force scores to be positive
    scores = absolute(scores)
    print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
    model.fit(X, y)
    explainer = shap.Explainer(model)
    shap_values = explainer(X)
    
    #EDIT VARIABLE BELOW TO CHANGE TEAM
    row = 2
    # visualize the first prediction's explanation
    print(pd.read_csv(filen).iloc[row,1])
    print(pd.read_csv(filen).iloc[row,-1])
    shap.plots.waterfall(shap_values[row])
    

    This outputs:

    Mean MAE: 3.348 (0.495)
    FC Cincinnati
    1.0
    

    SHAP Waterfall from XGBoost Regression Model