How to use RandomizedSearchCV or GridSearchCV for only 30% of data in order to speed up the process. My X.shape is 94456,100 and I'm trying to use RandomizedSearchCV or GridSearchCV but it's taking a verly long time. I'm runnig my code for several hours but still with no results. My code looks like this:
# Random Forest
param_grid = [
{'n_estimators': np.arange(2, 25), 'max_features': [2,5,10,25],
'max_depth': np.arange(10, 50), 'bootstrap': [True, False]}
]
clf = RandomForestClassifier()
grid_search_forest = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search_forest.fit(X, y)
rf_best_model = grid_search_forest.best_estimator_
# Decsision Tree
param_grid = {'max_depth': np.arange(1, 50), 'min_samples_split': [20, 30, 40]}
grid_search_dec_tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy')
grid_search_dec_tree.fit(X, y)
dt_best_model = grid_search_dec_tree.best_estimator_
# K Nearest Neighbor
knn = KNeighborsClassifier()
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid_search_knn = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
grid_search_knn.fit(X, y)
knn_best_model = grid_search_knn.best_estimator_
You can always sample a part of your data to fit your models. Although not designed for this purpose, train_test_split
can be useful here (it can take care of shuffling, stratification etc, which in a manual sampling you would have to take care of by yourself):
from sklearn.model_selection import train_test_split
X_train, _, y_train, _ = train_test_split(X, y, stratify=y, test_size=0.70)
By asking for test_size=0.70
, your training data X_train
will now be 30% of your initial set X
.
You should now replace all the .fit(X, y)
statements in your code with .fit(X_train, y_train)
.
On a more general level, all these np.arange()
statements in your grid look like overkill - I would suggest selecting some representative values in a list instead of going through a grid search in that detail. Random Forests in particular are notoriously insensitive in the number of trees n_estimators
, and adding one tree at a time is hardly useful - go for something like 'n_estimators': [50, 100]
...