Search code examples
pythonmachine-learningscikit-learnlightgbm

LGBM not varying predictions with random state


I am trying to compute prediction intervals for a classifier.

I trained in sklearn. Even after setting a new random_state parameter in my pipeline, it doesn't seem to change my results when refitting on the data. What can I do about this?

This is a relevant snippet of the code I'm using:

SEED_VALUE = 3
t_clf = Pipeline(steps=[('preprocessor', preprocessor), ('lgbm',
                        LGBMClassifier(class_weight="balanced",
                        random_state=SEED_VALUE, max_depth=20,
                        min_child_samples=20, num_leaves=31))
                        ])
states = [0,1,2,3]

for state in states:   
    train_temp = train.copy()
    t_clf.set_params(lgbm__random_state=state)
    t_clf.fit(train_temp, train_temp['label'])
    t_clf.predict_proba(test)   

# output from predict probability doesn't change with varying states

The same occurs when trying to change shuffle order or bagging seed.

Here are my current parameters if this is helpful to know:

LGBMClassifier(bagging_seed=2, boosting_type='gbdt', class_weight='balanced',
               colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
               max_depth=50, min_child_samples=1, min_child_weight=0.001,
               min_data_in_leaf=10, min_split_gain=0.0, n_estimators=100,
               n_jobs=-1, num_leaves=30, objective=None, random_state=1,
               reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

Solution

  • The reason why you get the same results regardless of the random seed is because no random sampling is performed at any stage with your model specification. If for instance you set colsample_bytree to a value less than 1 then you will see different predicted probabilities for different random seeds.

    from sklearn.datasets import make_classification
    from lightgbm import LGBMClassifier
    
    # generate some data
    X, y = make_classification(n_samples=1000, n_features=50, random_state=100)
    
    # set the random state
    for state in [0, 1, 2, 3]:
    
        # instantiate the classifier
        clf = LGBMClassifier(
            class_weight='balanced',
            max_depth=20,
            min_child_samples=20,
            num_leaves=31,
            random_state=state,
            colsample_bytree=0.1,
        )
    
        # fit the classifier
        clf.fit(X, y)
    
        # predict the class probabilities
        y_pred = clf.predict_proba(X)
    
        # print the predicted probability of the 
        # first class for the first sample 
        print([state, format(y_pred[0, 0], '.4%')])
    
        # [0, '97.8132%']
        # [1, '97.4980%']
        # [2, '98.3729%']
        # [3, '98.0737%']