python machine-learning scikit-learn cross-validation

How to use RandomizedSearchCV or GridSearchCV for only 30% of data

How to use RandomizedSearchCV or GridSearchCV for only 30% of data in order to speed up the process. My X.shape is 94456,100 and I'm trying to use RandomizedSearchCV or GridSearchCV but it's taking a verly long time. I'm runnig my code for several hours but still with no results. My code looks like this:

# Random Forest

param_grid = [
{'n_estimators': np.arange(2, 25), 'max_features': [2,5,10,25], 
 'max_depth': np.arange(10, 50), 'bootstrap': [True, False]}
]

clf = RandomForestClassifier()

grid_search_forest = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search_forest.fit(X, y)

rf_best_model = grid_search_forest.best_estimator_


# Decsision Tree

param_grid = {'max_depth': np.arange(1, 50), 'min_samples_split': [20, 30, 40]}

grid_search_dec_tree = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, scoring='accuracy')
grid_search_dec_tree.fit(X, y)

dt_best_model = grid_search_dec_tree.best_estimator_


# K Nearest Neighbor

knn = KNeighborsClassifier()
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid_search_knn = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')

grid_search_knn.fit(X, y)

knn_best_model = grid_search_knn.best_estimator_

Solution

You can always sample a part of your data to fit your models. Although not designed for this purpose, train_test_split can be useful here (it can take care of shuffling, stratification etc, which in a manual sampling you would have to take care of by yourself):

from sklearn.model_selection import train_test_split
X_train, _, y_train, _ = train_test_split(X, y, stratify=y, test_size=0.70)

By asking for test_size=0.70, your training data X_train will now be 30% of your initial set X.

You should now replace all the .fit(X, y) statements in your code with .fit(X_train, y_train).

On a more general level, all these np.arange() statements in your grid look like overkill - I would suggest selecting some representative values in a list instead of going through a grid search in that detail. Random Forests in particular are notoriously insensitive in the number of trees n_estimators, and adding one tree at a time is hardly useful - go for something like 'n_estimators': [50, 100]...