Search code examples
pythonregressionboosting

Fitting Ensemble Regressor within a loop generates repeat values


I'm trying to use an ensemble regressor to predict production based on a couple of material measurements. My data is annual, going back to 1965. (Some details stripped out and random data used because this is for a work project using sensitive data.)

I've stripped my code down to the bare minimum and I'm still seeing the issue:

import pandas as pd
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost.sklearn import XGBRegressor

X_past = pd.DataFrame(index = range(1965, 2020), data = dict(
    A = np.random.randint(4170, 19091, size = 55),
    B = np.random.randint(74, 337, size = 55)
))

X_future = pd.DataFrame(index = range(2020, 2023), data = dict(
    A = np.random.randint(4170, 19091, size = 3),
    B = np.random.randint(74, 337, size = 3)
))

y_past = pd.DataFrame(index = range(1965, 2020), data = dict(
    C = np.random.randint(12163, 42580, size = 55)
))

predictions = None
predictions = pd.DataFrame()

i = 0

while i < 10:
    i += 1
    
    reg = None
    y_pred = None
    
    X = X_past.values
    y = y_past.values.ravel()

    #reg = RandomForestRegressor(n_estimators = 300)
    reg = GradientBoostingRegressor(n_estimators = 300)
    #reg = XGBRegressor(n_estimators = 640, silent = True)

    reg.fit(X, y)

    y_pred = reg.predict(np.array(X_future))
    predictions = predictions.append(pd.Series(y_pred), ignore_index = True,)
    

predictions.columns = [2020, 2021, 2022]
predictions['Row-wise Duplicates'] = (predictions[2021] == predictions[2022])

predictions

That produces results such as:

2020 2021 2022 Row-wise Duplicates
13211.008045 29624.483861 34110.523735 False
13211.008045 29624.483861 33462.196606 False
13211.008045 29624.483861 33867.781932 False
13211.008045 29624.483861 33999.203849 False
13211.008045 29624.483861 33947.950436 False
13211.008045 29624.483861 33550.338744 False
13211.008045 29624.483861 34079.297200 False
13211.008045 29624.483861 33924.349324 False
13211.008045 29624.483861 33195.847833 False
13211.008045 29624.483861 33922.391200 False

As you can see, despite fitting anew on each iteration, I'm seeing a lot of repeat values.

I also sometimes see duplication of values across the years (usually 2021 matching 2022, which is why I calculate the Row-wise Duplicates column):

2020 2021 2022 Row-wise Duplicates
40819.929316 40819.929316 40819.929316 True
41516.312213 41516.312213 41516.312213 True
41516.312213 41516.312213 41516.312213 True
40901.743937 40901.743937 40901.743937 True
41191.025907 41191.025907 41191.025907 True
41109.211286 41109.211286 41109.211286 True
40910.834451 40910.834451 40910.834451 True
41799.581630 41799.581630 41799.581630 True
42512.531092 42512.531092 42512.531092 True
41018.306151 41018.306151 41018.306151 True

What am I doing wrong? Why am I seeing duplicates like this? And how can I fix it?


Solution

  • The algorithm you use, with the parameters you use, has no random internal element. So giving it the same training set and the same test set (as you do in your code) will produce the same results.

    You can use the subsample parameter with value smaller then 1 to make it use a different random sub-sample to train each base learner (see documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)

    So, if you replace your line with this one:

    reg = GradientBoostingRegressor(n_estimators = 300, subsample = 0.9)
    

    The algorithm will use a random subset of 90% of your data to train each learner, and you will get different results in each call. You can still make the results reproducible if you combine it with the random_state parameter.