I am dealing with a data set consists of 13 features and 550068 rows. I did k-fold cross validation and selected k
value as 10, and then selected the best model which has least root mean square error in my case the model is Gradient boosting regressor. Then I did hyperparameter tuning here is my code:
from sklearn.ensemble GradientBoostingRegressor
gradientboost = GradientBoostingRegressor(n_estimators = 300)
from sklearn.model_selection import RandomizedSearchCV
loss = ['ls', 'lad', 'huber']
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
min_samples_leaf = [1, 2, 4, 6, 8]
min_samples_split = [2, 4, 6, 10]
max_features = ['auto', 'sqrt', 'log2', None]
# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
'n_estimators': n_estimators,
'max_depth': max_depth,
'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split,
'max_features': max_features}
# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=gradientboost,
param_distributions=hyperparameter_grid,
cv=4, n_iter=50,
scoring = 'neg_mean_absolute_error',n_jobs = 4,
verbose = 5,
return_train_score = True,
random_state=42)
random_cv.fit(features,target)
its taking a lot of time for hyperparameter tuning it almost took 48 hours but not yet completed.I tried different different n_jobs and n_iters and cv values but the process is not speeding up.enter image description here.I also divided my dataset into 5 equal parts and tried parameter tuning on single part
dataframe_splits = np.array_split(dataframe, 5)
features = dataframe_splits[0].drop(columns=
['Purchase','User_ID', 'Product_ID'])
target = dataframe_splits[0]['Purchase']
But it is not working. It's taking a lot of time for single part also. I am using windows10 os and processor intel i5 7th generation.can any one help me to figure out this problem. Thanks in advance.
It's a combination of couple of things:
Training such thing on a local machine will not get you far. If you are not training a production grade model(but more like a side or university project), try these things:
If you are doing it for a production scale and want to use the whole dataset, you would need to get some extra powerful computational resources, such as a Virtual Machine and/or use a different package for training gradient boosted trees, such as xgboost or LightGBM. Both of those should support GPU training, so if you have a CUDA GPU you can use it as well.