Search code examples

hyperparameter tuning in sklearn using RandomizedSearchCV taking lot of time

I am dealing with a data set consists of 13 features and 550068 rows. I did k-fold cross validation and selected k value as 10, and then selected the best model which has least root mean square error in my case the model is Gradient boosting regressor. Then I did hyperparameter tuning here is my code:

from sklearn.ensemble GradientBoostingRegressor
gradientboost = GradientBoostingRegressor(n_estimators = 300)
from sklearn.model_selection import RandomizedSearchCV
loss = ['ls', 'lad', 'huber']
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
min_samples_leaf = [1, 2, 4, 6, 8] 
min_samples_split = [2, 4, 6, 10]
max_features = ['auto', 'sqrt', 'log2', None]

# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
    'n_estimators': n_estimators,
    'max_depth': max_depth,
    'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,
    'max_features': max_features}

# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=gradientboost,
            cv=4, n_iter=50,
            scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,

its taking a lot of time for hyperparameter tuning it almost took 48 hours but not yet completed.I tried different different n_jobs and n_iters and cv values but the process is not speeding up.enter image description here.I also divided my dataset into 5 equal parts and tried parameter tuning on single part

dataframe_splits = np.array_split(dataframe, 5)
features = dataframe_splits[0].drop(columns= 
['Purchase','User_ID', 'Product_ID'])
target = dataframe_splits[0]['Purchase']

But it is not working. It's taking a lot of time for single part also. I am using windows10 os and processor intel i5 7th generation.can any one help me to figure out this problem. Thanks in advance.


  • It's a combination of couple of things:

    • having half a million samples,
    • using gradient boosting with lot's of ensembles,
    • having a big search grid generally
    • doing 10 fold k validation.

    Training such thing on a local machine will not get you far. If you are not training a production grade model(but more like a side or university project), try these things:

    • make your sample much smaller (say 10k samples),
    • try to get a good understanding of what each hyperpameter does and how gradient boosting works. For example in your grid the loss function you use is not going make such a huge difference, whereas you are missing a very important parameter 'learning_rate'. Same goes for 'max_features' - 'auto' and 'none' do the same thing essentially, and it's a good idea to try to experiment with some floats there.
    • tune less parameters. Currently you are sampling 50 out of 3 * 5 * 5 * 5 * 4 * 4 = 6000 possible combinations. You could start with a smaller grid (say 100/200 possible combinations) and sample less combinations, see what parameters make the biggest change and then try to fine tune them, couple at a time, but not all at a time. The most expensive parameter there is 'n_estimators', since it puts 'n_estimators' trees together into an ensemble to form a full model. Finding an approx number of estimators that are on the edge of 'bias/variance' trade-off first and then putting it into the grid would be a good start.
    • decrease the number of k to 8 or even 5, this should reduce the running time drastically straight away.

    If you are doing it for a production scale and want to use the whole dataset, you would need to get some extra powerful computational resources, such as a Virtual Machine and/or use a different package for training gradient boosted trees, such as xgboost or LightGBM. Both of those should support GPU training, so if you have a CUDA GPU you can use it as well.