When I've used XGBoost for regression in the past, I've gotten differentiated predictions, but using an XGBClassifier on this dataset is resulting in all cases being predicted to have the same value. The true values of the test data are that 221 cases are a 0, and 49 cases are a 1. XGBoost seems to be latching onto that imbalance and predicting all 0's. I'm trying to figure out what I might need to adjust in the model's parameters to fix that.
Here is the code I'm running:
import pyreadstat
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Get data
dfloc = r"C:\Users\me\Desktop\Python practice\GBM_data.sav"
df, meta = pyreadstat.read_sav(dfloc, metadataonly=False)
# Filter data
df = df.dropna(subset=["Q31ar1"])
df = df.query("hgroup2==3")
IVs = ["Q35r1", "Q35r2", "Q35r3", "Q35r4", "Q35r5", "Q35r6", "Q35r7", "Q35r8", "Q35r9", "Q35r10", "Q35r11", "Q35r13", "Q35r14", "Q35r15", "Q35r16"]
# Separate samples
train, test = train_test_split(df, test_size=0.3, random_state=410)
train_features = train[IVs]
train_labels = train["Q31ar1"]
train_weight = train["WeightStack"]
test_features = test[IVs]
test_labels = test["Q31ar1"]
test_weight = test["WeightStack"]
# Set up model & params
model = XGBClassifier(objective = 'binary:logistic',
n_estimators = 1000,
learning_rate = .005,
subsample = .5,
max_depth = 4,
min_child_weight = 10,
tree_method = 'hist',
colsample_bytree = .5,
random_state = 410)
# Model
model.fit(train_features, train_labels, sample_weight = train_weight)
test_pred = model.predict(test_features)
Looking through some related questions, it seems like some people have had trouble with their models not going through enough boosting iterations. I'm running through 1000, which has been sufficient for regression in the past. Others were not setting the parameters correctly, but when I run model.get_params(), mine do appear to have been set; here's the output:
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 0.5,
'gamma': 0,
'learning_rate': 0.005,
'max_delta_step': 0,
'max_depth': 4,
'min_child_weight': 10,
'missing': None,
'n_estimators': 1000,
'n_jobs': 1,
'nthread': None,
'objective': 'binary:logistic',
'random_state': 410,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.5,
'verbosity': 1,
'tree_method': 'hist'}
Others have had issues with scaling. My predictors are all scaled the same way as is -- they're ordinal ratings scales, with values 1, 2, 3, 4, and 5. Still others have had trouble with NaNs, but I'm filtering my data to remove NaNs.
I'm wondering if I might need a different tree method or to mess around with the base_score parameter?
EDIT: Per Dan's comments, I tried a few things:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(train_features, train_labels)
test_pred_log = clf.predict(test_features)
accuracy_log = clf.score(test_features, test_labels)
train_pred = model.predict(train_features)
fpr, tpr, thresholds = roc_curve(train_labels, train_pred, pos_label=1)
pred_proba
. My probability estimates are differentiated, so that's great! The probabilities for belonging to class 1 are just all lower -- averaging around 20%, which makes sense, since about 20% of the sample is truly in class 1. The problem is that I don't know how to adjust the threshold on the predictions. I suppose I could do it manually using the results from pred_proba
, but is there a way to work that into the estimator instead?Found an answer on the stats section: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets
scale_pos_weight
seems to be a parameter that you can adjust to deal with imbalances in classes like this. Mine was set to the default, 1, which means that negative (0) and positive (1) cases are assumed to be showing up evenly. If I change this to 4, which is my ratio of negatives to positives, I start seeing cases predicted into 1.
My accuracy score goes down, but this makes sense: you get a higher % accuracy with this data by predicting everyone to be 0 since the vast majority of cases are 0, but I want to run this model not for accuracy but for information on the importances/contributions of each predictor, so I want differing predictions.
One answer in the link also suggested being more conservative by setting scale_pos_weight
to the sqrt of the ratio, which would be 2, in this case. I got a higher accuracy with 2 than 4, so that's what I'm going with, and I plan to look into this parameter in future classification models.
For a multi-class model, it looks like you're better off adjusting the case-level weights to bring your classes to even representation, as outlined here: https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost