Search code examples
machine-learningdata-sciencecatboost

How to improve the catboostregressor?


I am working on a data science regression problem with around 90,000 rows on train set and 8500 on test set. There are 9 categorical columns and no missing data. for this case, I am applied a catboostregressor which given me the pretty good R2(98.51) and MAE (3.77). Other nodels LGBM, XGBOOST performed under catboost.

Now I would like to increase the R2 value and decrease the MAE for more accurate results. That's what the demand too.

I have tuned many times by adding 'loss_function': ['MAE'], 'l2_leaf_reg':[3], 'random_strength': [4], 'bagging_temperature':[0.5] with different values but the performance is the same.

Can anyone help me how to boost the R2 value by minimizing MAE and MSE ?


Solution

  • Simple method -

    You can use Scikit-Learn's GridSearchCV to find the best hyperparameters for your CatBoostRegressor model. You can pass a dictionary of hyperparameters, and GridSearchCV will loop through all the hyperparameters and tell you which parameters are best. You can use it like this -

    from sklearn.model_selection import GridSearchCV
    
    model = CatBoostRegressor()
    parameters = {'depth' : [6,8,10],
                  'learning_rate' : [0.01, 0.05, 0.1],
                  'iterations'    : [30, 50, 100]
                  }
    
    grid = GridSearchCV(estimator=model, param_grid = parameters, cv = 2, n_jobs=-1)
    grid.fit(X_train, y_train)
    

    Another method -

    Now-a-days, models are complex and have a lot of parameters to tune. People are using Bayesian Optimization techniques, like Optuna, to tune hyperparameters. You can use Optuna to tune CatBoostClassifier like this:

    !pip install optuna
    import catboost
    import optuna
    
    def objective(trial):
        X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)
    
        param = {
            "objective": trial.suggest_categorical("objective", ["Logloss", "CrossEntropy"]),
            'learning_rate' : trial.suggest_loguniform('learning_rate', 0.001, 0.3),
            "colsample_bylevel": trial.suggest_float("colsample_bylevel", 0.01, 0.1),
            "max_depth": trial.suggest_int("max_depth", 1, 15),
            "boosting_type": trial.suggest_categorical("boosting_type", ["Ordered", "Plain"]),
            "bootstrap_type": trial.suggest_categorical(
                "bootstrap_type", ["Bayesian", "Bernoulli", "MVS"]),
        }
        
    
        if param["bootstrap_type"] == "Bayesian":
            param["bagging_temperature"] = trial.suggest_float("bagging_temperature", 0, 10)
        elif param["bootstrap_type"] == "Bernoulli":
            param["subsample"] = trial.suggest_uniform("subsample", 0.1, 1)
    
        gbm = catboost.CatBoostClassifier(**param, iterations = 10000)
    
        gbm.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0, early_stopping_rounds = 100)
    
        preds = gbm.predict(X_val)
        pred_labels = np.rint(preds)
        accuracy = accuracy_score(y_val, pred_labels)
        
        return accuracy
    
    study = optuna.create_study(direction = "maximize")
    study.optimize(objective, n_trials = 200, show_progress_bar = True)
    

    This method take a lot of time (1-2 hrs, maybe). This method is best to use when you have a lot of parameters to tune. Else, use Grid Search CV.