I am trying to predict a variable (Y) using LightGBM Regression. However my predicted values are all the same (i.e. constant). Can someone help out in detecting the problem.
data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]
df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])
data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]
df_y = pd.DataFrame(data_y, columns=['Value'])
X_df_earn_ind_fin_train, X_df_earn_ind_fin_test, y_df_earn_ind_fin_train, y_df_earn_ind_fin_test = train_test_split(df_x, df_y, test_size=0.3, random_state=21)
hyper_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': ['mape', 'auc'],
'learning_rate': 0.01,
'feature_fraction': 0.9,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'verbose': 0,
'verbose_eval': -1,
"max_depth": 10,
"num_leaves": 96,
"max_bin": 256,
"num_iterations": 1000,
"n_estimators": 250
}
gbm = lgm.LGBMRegressor(**hyper_params)
gbm.fit(X_df_earn_ind_fin_train, y_df_earn_ind_fin_train,
eval_set=[(X_df_earn_ind_fin_test, y_df_earn_ind_fin_test)],
eval_metric='mape')
y_pred_df_earn_ind_test = gbm.predict(X_df_earn_ind_fin_test)
However my output is just an array of a constant value
y_pred_df_earn_ind_test =
array([1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863])
How do I rectify this?
When the training data contains less than 200 rows, use the following parameters:
min_data_in_leaf = 1
min_data_in_bin = 1
LightGBM has a few important parameters to prevent against overfitting, and the default values of these assume you have at least a few hundred samples.
min_data_in_leaf
: minimum number of samples that must fall into a leaf node (default = 20)min_data_in_bin
: minimum number of samples to group together into one histogram "bin" when LightGBM discretizes features (default = 3)For more details on that, see "Why R2 score is zero in LightGBM?" and "Why does this simple LightGBM classifier perform poorly?".
For a very small dataset like the one in your example (41 rows, 3 columns), those default values might be very limiting, resulting in only a few splits being added per tree.
Consider the following example using exactly the data you provided, with Python 3.11, lightgbm==4.3.0
, pandas==2.2.1
, and scikit-learn==1.4.1
.
import lightgbm as lgb
import pandas as pd
from sklearn.model_selection import train_test_split
data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]
df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])
data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]
df_y = pd.DataFrame(data_y, columns=['Value'])
X_train, X_test, y_train, y_test = train_test_split(
df_x,
df_y,
test_size=0.3,
random_state=21
)
params = {
"num_iterations": 10,
"objective": "regression",
"min_data_in_leaf": 1,
"min_data_in_bin": 1,
"verbose": 0,
}
# train
gbm = lgb.LGBMRegressor(**params)
gbm.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='mape')
# predict
preds = gbm.predict(X_test)
print(preds)
That produces predictions with some variation.
[1514.86588126 1557.1389268 1423.54076682 1514.86588126 1488.24836945
1541.52116271 1555.63537413 1393.69927646 1404.48244093 1465.1569698
1404.48244093 1404.48244093 1514.86588126 1440.95713788 1535.84165832
1482.58308126 1471.96999117 1504.50006758]
And the following scores on the test set
from sklearn.metrics import mean_absolute_error, r2_score
mean_absolute_error(y_test, preds)
# 45.212
r2_score(y_test, preds)
# 0.47
Some other notes related to the original question:
num_iterations
and n_estimators
are aliases for each other... they mean exactly the same thing. Just use one of them. (LightGBM docs)"auc"
is a classification metric... it isn't appropriate for regression problems (LightGBM docs)tas"
is only for the LightGBM CLI. It doesn't affect the Python package at all. Omit it. (LightGBM docs)scikit-learn
estimators for LightGBM, omit metric
from params
and just pass the eval_metric
keyword arguments to .fit()