python scipy integer-overflow grid-search

"OverflowError: Python int too large to convert to C long" when running a RandomizedSearchCV with scipy distributions

I want to run the following RandomizedSearch:

from scipy.stats import reciprocal, uniform

tree_reg = DecisionTreeRegressor()

param_grid = {
    "max_depth": np.arange(1, 12, 1),
    "min_samples_leaf": np.arange(2, 2259, 10),
    "min_samples_split": np.arange(2, 2259, 2),
    "max_leaf_nodes": np.arange(2, 2259, 2),
    "max_features": np.arange(2, len(features))
    }

rnd_search_tree = RandomizedSearchCV(tree_reg, param_grid,cv=cv_split, n_iter=10000,
                                    scoring=['neg_root_mean_squared_error', 'r2'], refit='neg_root_mean_squared_error',
                                    return_train_score=True, verbose=2)

rnd_search_tree.fit(dataset_prepared_stand, dataset_labels)

Where 2259 is the number of data points I have. However, I get the following error:

OverflowError                             Traceback (most recent call last)
<ipython-input-809-76074980f31c> in <module>
     13                                     return_train_score=True, verbose=2)
     14 
---> 15 rnd_search_tree.fit(dataset_prepared_stand, dataset_labels)

~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
    734                 return results
    735 
--> 736             self._run_search(evaluate_candidates)
    737 
    738         # For multi-metric evaluation, store the best_index_, best_params_ and

~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in _run_search(self, evaluate_candidates)
   1529         evaluate_candidates(ParameterSampler(
   1530             self.param_distributions, self.n_iter,
-> 1531             random_state=self.random_state))

~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in evaluate_candidates(candidate_params)
    698 
    699             def evaluate_candidates(candidate_params):
--> 700                 candidate_params = list(candidate_params)
    701                 n_candidates = len(candidate_params)
    702 

~\anaconda3\envs\data_analysis\lib\site-packages\sklearn\model_selection\_search.py in __iter__(self)
    283                 n_iter = grid_size
    284             for i in sample_without_replacement(grid_size, n_iter,
--> 285                                                 random_state=rng):
    286                 yield param_grid[i]
    287 

sklearn\utils\_random.pyx in sklearn.utils._random.sample_without_replacement()

OverflowError: Python int too large to convert to C long

I do not get it if I'm taking away even just one of the parameters to search over (or if I reduce the step of the range to 1000 for example). Is there a way to solve it passing all the values I'd like to try?

Solution

I don't see an alternative to dropping RandomizedSearchCV. Internally RandomSearchCV calls sample_without_replacement to sample from your feature space. When your feature space is larger than C's long size, scikit-learn's sample_without_replacement simply breaks down.

Luckily, random search kind of sucks anyway. Check out optuna as an alternative. It is way smarter about where in your feature space to spend time evaluating (paying more attention to high-performing areas), and does not require you to limit your feature space precision beforehand (that is, you can omit the step size). More generally, check out the field of AutoML.

If you insist on random search however, you'll have to find another implementation. Actually, optuna also supports a random sampler.