Search code examples
pythonmachine-learninghyperparameterscatboost

Unable to tune hyperparameters for CatBoostRegressor


I am trying to fit a CatBoostRegressor to my model. When I perform K fold CV for the baseline model everything works fine. But when I use Optuna for hyperparameter tuning, it does something really weird. It runs the first trial and then throws the following error:-

[I 2021-08-26 08:00:56,865] Trial 0 finished with value: 0.7219653113910736 and parameters: 
{'model__depth': 2, 'model__iterations': 1715, 'model__subsample': 0.5627211605250965, 
'model__learning_rate': 0.15601805222619286}. Best is trial 0 with value: 0.7219653113910736. 
[W 2021-08-26 08:00:56,869] 

Trial 1 failed because of the following error: CatBoostError("You 
can't change params of fitted model.")
Traceback (most recent call last):

I used a similar approach for XGBRegressor and LGBM and they worked fine. So why am I getting an error for CatBoost?

Below is my code:-

cat_cols = [cname for cname in train_data1.columns if 
train_data1[cname].dtype == 'object']
num_cols = [cname for cname in train_data1.columns if 
train_data1[cname].dtype in ['int64', 'float64']]


from sklearn.preprocessing import StandardScaler
num_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 
                             'mean')),('scale', StandardScaler())])
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 
                             'most_frequent')), ('encode', 
                         OneHotEncoder(handle_unknown = 'ignore'))])

from sklearn.compose import ColumnTransformer

preproc = ColumnTransformer(transformers = [('cat', cat_trans, 
                           cat_cols), ('num', num_trans, num_cols)])


from catboost import CatBoostRegressor
cbr_model = CatBoostRegressor(random_state = 69, 
                             loss_function='RMSE', 
                             eval_metric='RMSE', 
                             leaf_estimation_method ='Newton', 
                             bootstrap_type='Bernoulli', task_type = 
                             'GPU')

pipe = Pipeline(steps = [('preproc', preproc), ('model', cbr_model)])


import optuna
from sklearn.metrics import mean_squared_error

def objective(trial):
    model__depth = trial.suggest_int('model__depth', 2, 10)
    model__iterations = trial.suggest_int('model__iterations', 100, 
                                          2000)
    model__subsample = trial.suggest_float('model__subsample', 0.0, 
                                           1.0)
    model__learning_rate =trial.suggest_float('model__learning_rate', 
                                              0.001, 0.3, log = True)

    params = {'model__depth' : model__depth,
              'model__iterations' : model__iterations,
              'model__subsample' : model__subsample, 
              'model__learning_rate' : model__learning_rate}

    pipe.set_params(**params)
    pipe.fit(train_x, train_y)
    pred = pipe.predict(test_x)

    return np.sqrt(mean_squared_error(test_y, pred))

cbr_study = optuna.create_study(direction = 'minimize')
cbr_study.optimize(objective, n_trials = 10)

Solution

  • Apparently, CatBoost has this mechanism where you have to create new CatBoost model object for each trial. I opened an issue on Github regarding this and they said it was implemented to to protect results of a long training. which makes no sense to me!

    As of right now, the only workaround to this issue is you HAVE to create new CatBoost models for each and every trial!

    The other much sensible way, if you are using Pipeline method and Optuna, is to define the final pipeline instance and the model instance in the optuna function. And then again define the final pipeline instance outside the function.

    That way you do not have to define 50 instances if you are using 50 trials!!