python machine-learning scikit-learn cross-validation hyperparameters

Hyperparameter tuning

I'm currently doing a project on my own. For this project i tried to compare results of multiple algorithms. But i want to be sure that every algorithm tested is configured to give the best results.

So i use cross validation and to test every combination of parameters and choose the best.

For example :

def KMeanstest(param_grid, n_jobs): 

    estimator = KMeans()

    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

    regressor = GridSearchCV(estimator=estimator, cv=cv, param_grid=param_grid, n_jobs=n_jobs) 

    regressor.fit(X_train, y_train) 

    print("Best Estimator learned through GridSearch") 
    print(regressor.best_estimator_)

    return cv, regressor.best_estimator_

param_grid={'n_clusters': [2], 
            'init': ['k-means++', 'random'],
            'max_iter': [100, 200, 300, 400, 500],
            'n_init': [8, 9, 10, 11, 12, 13, 14, 15, 16], 
            'tol': [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6], 
            'precompute_distances': ['auto', True, False], 
            'random_state': [42],
            'copy_x': [True, False],
            'n_jobs': [-1],
            'algorithm': ['auto', 'full', 'elkan']
           }

n_jobs=-1

cv,best_est=KMeanstest(param_grid, n_jobs)

But this is very time comsuming. I want to know if this method is the best or if i need to use a different approach.

Thank you for your help

Solution

The problem with GridSearch is that it is very time-consuming as you have rightly said. RandomSearch can be a good option sometimes, but it is not optimal.

Bayesian Optimization is another option. this allows us to rapidly zone in on the optimal parameter set using a probabilistic approach. I have tried it personally using the hyperopt library in python and it works really well. Check out this tutorial for more information. You can also download the associated notebook from my GitHub

The good thing is that since you have already experimented with GridSearch, you have a rough idea of which parameter ranges do not work well. So you can define a more accurate search space for the Bayesian Optimization to run on, and this will reduce the time even more. Also, hyperopt can be used to compare multiple algorithms and their respective parameters.