python machine-learning scikit-learn lightgbm

LGBM not varying predictions with random state

I am trying to compute prediction intervals for a classifier.

I trained in sklearn. Even after setting a new random_state parameter in my pipeline, it doesn't seem to change my results when refitting on the data. What can I do about this?

This is a relevant snippet of the code I'm using:

SEED_VALUE = 3
t_clf = Pipeline(steps=[('preprocessor', preprocessor), ('lgbm',
                        LGBMClassifier(class_weight="balanced",
                        random_state=SEED_VALUE, max_depth=20,
                        min_child_samples=20, num_leaves=31))
                        ])
states = [0,1,2,3]

for state in states:   
    train_temp = train.copy()
    t_clf.set_params(lgbm__random_state=state)
    t_clf.fit(train_temp, train_temp['label'])
    t_clf.predict_proba(test)   

# output from predict probability doesn't change with varying states

The same occurs when trying to change shuffle order or bagging seed.

Here are my current parameters if this is helpful to know:

LGBMClassifier(bagging_seed=2, boosting_type='gbdt', class_weight='balanced',
               colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
               max_depth=50, min_child_samples=1, min_child_weight=0.001,
               min_data_in_leaf=10, min_split_gain=0.0, n_estimators=100,
               n_jobs=-1, num_leaves=30, objective=None, random_state=1,
               reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

Solution

The reason why you get the same results regardless of the random seed is because no random sampling is performed at any stage with your model specification. If for instance you set colsample_bytree to a value less than 1 then you will see different predicted probabilities for different random seeds.

from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier

# generate some data
X, y = make_classification(n_samples=1000, n_features=50, random_state=100)

# set the random state
for state in [0, 1, 2, 3]:

    # instantiate the classifier
    clf = LGBMClassifier(
        class_weight='balanced',
        max_depth=20,
        min_child_samples=20,
        num_leaves=31,
        random_state=state,
        colsample_bytree=0.1,
    )

    # fit the classifier
    clf.fit(X, y)

    # predict the class probabilities
    y_pred = clf.predict_proba(X)

    # print the predicted probability of the 
    # first class for the first sample 
    print([state, format(y_pred[0, 0], '.4%')])

    # [0, '97.8132%']
    # [1, '97.4980%']
    # [2, '98.3729%']
    # [3, '98.0737%']