Search code examples
pythonpython-3.xmachine-learninglightgbmboosting

Constant predicted values in LightGBM


I am trying to predict a variable (Y) using LightGBM Regression. However my predicted values are all the same (i.e. constant). Can someone help out in detecting the problem.

data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]

df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])

data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]

df_y = pd.DataFrame(data_y, columns=['Value'])

X_df_earn_ind_fin_train, X_df_earn_ind_fin_test, y_df_earn_ind_fin_train, y_df_earn_ind_fin_test = train_test_split(df_x, df_y, test_size=0.3, random_state=21)

hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['mape', 'auc'],
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.7,
    'bagging_freq': 10,
    'verbose': 0,
    'verbose_eval': -1,
    "max_depth": 10,
    "num_leaves": 96,  
    "max_bin": 256,
    "num_iterations": 1000,
    "n_estimators": 250
}

gbm = lgm.LGBMRegressor(**hyper_params)
gbm.fit(X_df_earn_ind_fin_train, y_df_earn_ind_fin_train,
        eval_set=[(X_df_earn_ind_fin_test, y_df_earn_ind_fin_test)],
        eval_metric='mape')

y_pred_df_earn_ind_test = gbm.predict(X_df_earn_ind_fin_test)

However my output is just an array of a constant value

y_pred_df_earn_ind_test = 
array([1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863])

How do I rectify this?


Solution

  • Short Answer

    When the training data contains less than 200 rows, use the following parameters:

    • min_data_in_leaf = 1
    • min_data_in_bin = 1

    Details

    LightGBM has a few important parameters to prevent against overfitting, and the default values of these assume you have at least a few hundred samples.

    • min_data_in_leaf: minimum number of samples that must fall into a leaf node (default = 20)
    • min_data_in_bin: minimum number of samples to group together into one histogram "bin" when LightGBM discretizes features (default = 3)

    For more details on that, see "Why R2 score is zero in LightGBM?" and "Why does this simple LightGBM classifier perform poorly?".

    For a very small dataset like the one in your example (41 rows, 3 columns), those default values might be very limiting, resulting in only a few splits being added per tree.

    Consider the following example using exactly the data you provided, with Python 3.11, lightgbm==4.3.0, pandas==2.2.1, and scikit-learn==1.4.1.

    import lightgbm as lgb
    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]
    
    df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])
    
    data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]
    
    df_y = pd.DataFrame(data_y, columns=['Value'])
    
    X_train, X_test, y_train, y_test = train_test_split(
        df_x,
        df_y,
        test_size=0.3,
        random_state=21
    )
    
    params = {
        "num_iterations": 10,
        "objective": "regression",
        "min_data_in_leaf": 1,
        "min_data_in_bin": 1,
        "verbose": 0,
    }
    
    # train
    gbm = lgb.LGBMRegressor(**params)
    gbm.fit(X_train, y_train,
            eval_set=[(X_test, y_test)],
            eval_metric='mape')
    
    # predict
    preds = gbm.predict(X_test)
    print(preds)
    

    That produces predictions with some variation.

    [1514.86588126 1557.1389268  1423.54076682 1514.86588126 1488.24836945
     1541.52116271 1555.63537413 1393.69927646 1404.48244093 1465.1569698
     1404.48244093 1404.48244093 1514.86588126 1440.95713788 1535.84165832
     1482.58308126 1471.96999117 1504.50006758]
    

    And the following scores on the test set

    from sklearn.metrics import mean_absolute_error, r2_score
    
    mean_absolute_error(y_test, preds)
    # 45.212
    
    r2_score(y_test, preds)
    # 0.47
    

    Some other notes related to the original question:

    • num_iterations and n_estimators are aliases for each other... they mean exactly the same thing. Just use one of them. (LightGBM docs)
    • "auc" is a classification metric... it isn't appropriate for regression problems (LightGBM docs)
    • tas" is only for the LightGBM CLI. It doesn't affect the Python package at all. Omit it. (LightGBM docs)
    • In the scikit-learn estimators for LightGBM, omit metric from params and just pass the eval_metric keyword arguments to .fit()