Fitting Ensemble Regressor within a loop generates repeat values

I'm trying to use an ensemble regressor to predict production based on a couple of material measurements. My data is annual, going back to 1965. (Some details stripped out and random data used because this is for a work project using sensitive data.)

I've stripped my code down to the bare minimum and I'm still seeing the issue:

import pandas as pd
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost.sklearn import XGBRegressor

X_past = pd.DataFrame(index = range(1965, 2020), data = dict(
    A = np.random.randint(4170, 19091, size = 55),
    B = np.random.randint(74, 337, size = 55)
))

X_future = pd.DataFrame(index = range(2020, 2023), data = dict(
    A = np.random.randint(4170, 19091, size = 3),
    B = np.random.randint(74, 337, size = 3)
))

y_past = pd.DataFrame(index = range(1965, 2020), data = dict(
    C = np.random.randint(12163, 42580, size = 55)
))

predictions = None
predictions = pd.DataFrame()

i = 0

while i < 10:
    i += 1
    
    reg = None
    y_pred = None
    
    X = X_past.values
    y = y_past.values.ravel()

    #reg = RandomForestRegressor(n_estimators = 300)
    reg = GradientBoostingRegressor(n_estimators = 300)
    #reg = XGBRegressor(n_estimators = 640, silent = True)

    reg.fit(X, y)

    y_pred = reg.predict(np.array(X_future))
    predictions = predictions.append(pd.Series(y_pred), ignore_index = True,)
    

predictions.columns = [2020, 2021, 2022]
predictions['Row-wise Duplicates'] = (predictions[2021] == predictions[2022])

predictions

That produces results such as:

2020	2021	2022	Row-wise Duplicates
13211.008045	29624.483861	34110.523735	False
13211.008045	29624.483861	33462.196606	False
13211.008045	29624.483861	33867.781932	False
13211.008045	29624.483861	33999.203849	False
13211.008045	29624.483861	33947.950436	False
13211.008045	29624.483861	33550.338744	False
13211.008045	29624.483861	34079.297200	False
13211.008045	29624.483861	33924.349324	False
13211.008045	29624.483861	33195.847833	False
13211.008045	29624.483861	33922.391200	False

As you can see, despite fitting anew on each iteration, I'm seeing a lot of repeat values.

I also sometimes see duplication of values across the years (usually 2021 matching 2022, which is why I calculate the Row-wise Duplicates column):

2020	2021	2022	Row-wise Duplicates
40819.929316	40819.929316	40819.929316	True
41516.312213	41516.312213	41516.312213	True
41516.312213	41516.312213	41516.312213	True
40901.743937	40901.743937	40901.743937	True
41191.025907	41191.025907	41191.025907	True
41109.211286	41109.211286	41109.211286	True
40910.834451	40910.834451	40910.834451	True
41799.581630	41799.581630	41799.581630	True
42512.531092	42512.531092	42512.531092	True
41018.306151	41018.306151	41018.306151	True

What am I doing wrong? Why am I seeing duplicates like this? And how can I fix it?

Solution

The algorithm you use, with the parameters you use, has no random internal element. So giving it the same training set and the same test set (as you do in your code) will produce the same results.

You can use the subsample parameter with value smaller then 1 to make it use a different random sub-sample to train each base learner (see documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)

So, if you replace your line with this one:

reg = GradientBoostingRegressor(n_estimators = 300, subsample = 0.9)

The algorithm will use a random subset of 90% of your data to train each learner, and you will get different results in each call. You can still make the results reproducible if you combine it with the random_state parameter.