Search code examples
scikit-learnscipyrandom-seedhyperparametersscipy.stats

Is there a discretised version of scipy.stats.loguniform?


When running hyperparameter tuning on a random forest, I sometimes want to specify a large integer range of values for integer parameters like min_samples_leaf (e.g. ranging from the default value of 1 up to 100).

Whilst I could specify this range using scipy.stats.randint(1, 100), I'd prefer to use a log-uniform distribution as my range covers two orders of magnitude. SciPy has stats.loguniform for continuous rvs, but doesn't seem to have a discretised equivalent.

A quick solution for approximating the discretised space is to just sample lots of continuous values and then convert the samples to integers:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

from scipy.stats import loguniform
import numpy as np

#Draw lots of samples and discretise them, in order to approximate
# a discretised loguniform sample space
def discretised_loguniform_samples(low, high, seed=None, sample_size=100_000):
    float_rvs = loguniform(low, high, seed=seed).rvs(n_samples)
    return float_rvs.round().astype(int)

Usage:

rf_param_distributions = {
    'min_samples_leaf': discretised_loguniform_samples(low=1, high=100, seed=0),
     ...
}

#This will draw n_iter=10 samples from the fixed list of integers created above.
# The list from which the samples are drawn is fixed in advance and therefore
# can't exploit the randomness imparted by the consumable random_state argument.
RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
    param_distributions=rf_param_distributions,
    n_iter=10,
    random_state=np.random.RandomState(0),
    ...
)

The downside is that when I define the list of integers in advance, that gives me only a single static space (however large) that is fixed throughout the tuning process. I want to exploit the randomness imparted by the consumable random_seed= in RandomizedSearchCV*, rather than being limited to a pre-defined list.

How can I modify loguniform in such a way that I get a discretised version of its samples for each call to rvs()?

*RandomizedSearchCV passes down its random_state= parameter to the distribution's rvs() method. The docs seem ambiguous on this point, stating that random_state= is "used for sampling from lists of possible values instead of scipy.stats distributions".


Solution

  • The approach below simply decorates/wraps loguniform.rvs() with a float-to-int function:

    def float_to_int(rvs):
        def rvs_wrapper(*args, **kwargs):
            return rvs(*args, **kwargs).round().astype(int)
        return rvs_wrapper
    
    def int_loguniform(low, high):
        #Create a loguniform object
        lu = loguniform(low, high)
    
        #Wrap its rvs() with float-to-int
        lu.rvs = float_to_int(lu.rvs)
    
        #Return modified loguniform object
        return lu
    

    Usage:

    rf_param_distributions = {
        'min_samples_leaf': int_loguniform(low=1, high=100),
         ...
    }
    
    #Each iteration will consume the supplied random_state,
    # We are no longer limited to drawing samples from a fixed list.
    RandomizedSearchCV(
        estimator=RandomForestRegressor(random_state=np.random.RandomState(0)),
        param_distributions=rf_param_distributions,
        n_iter=10,
        random_state=np.random.RandomState(0),
        ...
    )