I want to run the following RandomizedSearch
:
from scipy.stats import reciprocal, uniform
tree_reg = DecisionTreeRegressor()
param_grid = {
"max_depth": np.arange(1, 12, 1),
"min_samples_leaf": np.arange(2, 2259, 10),
"min_samples_split": np.arange(2, 2259, 2),
"max_leaf_nodes": np.arange(2, 2259, 2),
"max_features": np.arange(2, len(features))
}
rnd_search_tree = RandomizedSearchCV(tree_reg, param_grid,cv=cv_split, n_iter=10000,
scoring=['neg_root_mean_squared_error', 'r2'], refit='neg_root_mean_squared_error',
return_train_score=True, verbose=2)
rnd_search_tree.fit(dataset_prepared_stand, dataset_labels)
Where 2259 is the number of data points I have. However, I get the following error:
OverflowError Traceback (most recent call last)
<ipython-input-809-76074980f31c> in <module>
13 return_train_score=True, verbose=2)
14
---> 15 rnd_search_tree.fit(dataset_prepared_stand, dataset_labels)
~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
734 return results
735
--> 736 self._run_search(evaluate_candidates)
737
738 # For multi-metric evaluation, store the best_index_, best_params_ and
~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
1529 evaluate_candidates(ParameterSampler(
1530 self.param_distributions, self.n_iter,
-> 1531 random_state=self.random_state))
~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
698
699 def evaluate_candidates(candidate_params):
--> 700 candidate_params = list(candidate_params)
701 n_candidates = len(candidate_params)
702
~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in __iter__(self)
283 n_iter = grid_size
284 for i in sample_without_replacement(grid_size, n_iter,
--> 285 random_state=rng):
286 yield param_grid[i]
287
sklearn\utils\_random.pyx in sklearn.utils._random.sample_without_replacement()
OverflowError: Python int too large to convert to C long
I do not get it if I'm taking away even just one of the parameters to search over (or if I reduce the step of the range to 1000 for example). Is there a way to solve it passing all the values I'd like to try?
I don't see an alternative to dropping RandomizedSearchCV
. Internally RandomSearchCV
calls sample_without_replacement
to sample from your feature space. When your feature space is larger than C's long
size, scikit-learn's sample_without_replacement
simply breaks down.
Luckily, random search kind of sucks anyway. Check out optuna
as an alternative. It is way smarter about where in your feature space to spend time evaluating (paying more attention to high-performing areas), and does not require you to limit your feature space precision beforehand (that is, you can omit the step size). More generally, check out the field of AutoML.
If you insist on random search however, you'll have to find another implementation. Actually, optuna
also supports a random sampler.