I'm trying to use an ensemble regressor to predict production based on a couple of material measurements. My data is annual, going back to 1965. (Some details stripped out and random data used because this is for a work project using sensitive data.)
I've stripped my code down to the bare minimum and I'm still seeing the issue:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from xgboost.sklearn import XGBRegressor
X_past = pd.DataFrame(index = range(1965, 2020), data = dict(
A = np.random.randint(4170, 19091, size = 55),
B = np.random.randint(74, 337, size = 55)
))
X_future = pd.DataFrame(index = range(2020, 2023), data = dict(
A = np.random.randint(4170, 19091, size = 3),
B = np.random.randint(74, 337, size = 3)
))
y_past = pd.DataFrame(index = range(1965, 2020), data = dict(
C = np.random.randint(12163, 42580, size = 55)
))
predictions = None
predictions = pd.DataFrame()
i = 0
while i < 10:
i += 1
reg = None
y_pred = None
X = X_past.values
y = y_past.values.ravel()
#reg = RandomForestRegressor(n_estimators = 300)
reg = GradientBoostingRegressor(n_estimators = 300)
#reg = XGBRegressor(n_estimators = 640, silent = True)
reg.fit(X, y)
y_pred = reg.predict(np.array(X_future))
predictions = predictions.append(pd.Series(y_pred), ignore_index = True,)
predictions.columns = [2020, 2021, 2022]
predictions['Row-wise Duplicates'] = (predictions[2021] == predictions[2022])
predictions
That produces results such as:
2020 | 2021 | 2022 | Row-wise Duplicates |
---|---|---|---|
13211.008045 | 29624.483861 | 34110.523735 | False |
13211.008045 | 29624.483861 | 33462.196606 | False |
13211.008045 | 29624.483861 | 33867.781932 | False |
13211.008045 | 29624.483861 | 33999.203849 | False |
13211.008045 | 29624.483861 | 33947.950436 | False |
13211.008045 | 29624.483861 | 33550.338744 | False |
13211.008045 | 29624.483861 | 34079.297200 | False |
13211.008045 | 29624.483861 | 33924.349324 | False |
13211.008045 | 29624.483861 | 33195.847833 | False |
13211.008045 | 29624.483861 | 33922.391200 | False |
As you can see, despite fitting anew on each iteration, I'm seeing a lot of repeat values.
I also sometimes see duplication of values across the years (usually 2021 matching 2022, which is why I calculate the Row-wise Duplicates column):
2020 | 2021 | 2022 | Row-wise Duplicates |
---|---|---|---|
40819.929316 | 40819.929316 | 40819.929316 | True |
41516.312213 | 41516.312213 | 41516.312213 | True |
41516.312213 | 41516.312213 | 41516.312213 | True |
40901.743937 | 40901.743937 | 40901.743937 | True |
41191.025907 | 41191.025907 | 41191.025907 | True |
41109.211286 | 41109.211286 | 41109.211286 | True |
40910.834451 | 40910.834451 | 40910.834451 | True |
41799.581630 | 41799.581630 | 41799.581630 | True |
42512.531092 | 42512.531092 | 42512.531092 | True |
41018.306151 | 41018.306151 | 41018.306151 | True |
What am I doing wrong? Why am I seeing duplicates like this? And how can I fix it?
The algorithm you use, with the parameters you use, has no random internal element. So giving it the same training set and the same test set (as you do in your code) will produce the same results.
You can use the subsample
parameter with value smaller then 1
to make it use a different random sub-sample to train each base learner (see documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
So, if you replace your line with this one:
reg = GradientBoostingRegressor(n_estimators = 300, subsample = 0.9)
The algorithm will use a random subset of 90% of your data to train each learner, and you will get different results in each call. You can still make the results reproducible if you combine it with the random_state
parameter.