I am trying to compute prediction intervals for a classifier.
I trained in sklearn. Even after setting a new random_state
parameter in my pipeline, it doesn't seem to change my results when refitting on the data. What can I do about this?
This is a relevant snippet of the code I'm using:
SEED_VALUE = 3
t_clf = Pipeline(steps=[('preprocessor', preprocessor), ('lgbm',
LGBMClassifier(class_weight="balanced",
random_state=SEED_VALUE, max_depth=20,
min_child_samples=20, num_leaves=31))
])
states = [0,1,2,3]
for state in states:
train_temp = train.copy()
t_clf.set_params(lgbm__random_state=state)
t_clf.fit(train_temp, train_temp['label'])
t_clf.predict_proba(test)
# output from predict probability doesn't change with varying states
The same occurs when trying to change shuffle order or bagging seed.
Here are my current parameters if this is helpful to know:
LGBMClassifier(bagging_seed=2, boosting_type='gbdt', class_weight='balanced',
colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
max_depth=50, min_child_samples=1, min_child_weight=0.001,
min_data_in_leaf=10, min_split_gain=0.0, n_estimators=100,
n_jobs=-1, num_leaves=30, objective=None, random_state=1,
reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
The reason why you get the same results regardless of the random seed is because no random sampling is performed at any stage with your model specification. If for instance you set colsample_bytree
to a value less than 1 then you will see different predicted probabilities for different random seeds.
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
# generate some data
X, y = make_classification(n_samples=1000, n_features=50, random_state=100)
# set the random state
for state in [0, 1, 2, 3]:
# instantiate the classifier
clf = LGBMClassifier(
class_weight='balanced',
max_depth=20,
min_child_samples=20,
num_leaves=31,
random_state=state,
colsample_bytree=0.1,
)
# fit the classifier
clf.fit(X, y)
# predict the class probabilities
y_pred = clf.predict_proba(X)
# print the predicted probability of the
# first class for the first sample
print([state, format(y_pred[0, 0], '.4%')])
# [0, '97.8132%']
# [1, '97.4980%']
# [2, '98.3729%']
# [3, '98.0737%']