Search code examples
pythonnumpymachine-learninggenetic-algorithmpygad

PyGAD is not receiving integer parameters according to documentation


I am trying to use PyGAD to optimize hyper-parameters in ML models. According to documentation

The gene_space parameter customizes the space of values of each gene ... list, tuple, numpy.ndarray, or any range like range, numpy.arange(), or numpy.linspace: It holds the space for each individual gene. But this space is usually discrete. That is there is a set of finite values to select from.

As you can see, the first element of gene_space, which corresponds to solution[0] in the Genetic Algorithm definition, is an array of integers. According to documentation, this should be a discrete space, which it is. However, when this array of integers (from np.linspace, which is okay to use), it is interpreted by Random Forest Classifier as a numpy.float64'> (see error in 3rd code block.)

I don't understand where this change of data type is occurring. Is this a PyGAD problem and how can I fix? Or is it a numpy -> sklearn problem?

gene_space = [ 
    # n_estimators
    np.linspace(50,200,25, dtype='int'),
    # min_samples_split, 
    np.linspace(2,10,5, dtype='int'),
    # min_samples_leaf,
    np.linspace(1,10,5, dtype='int'),
    # min_impurity_decrease
    np.linspace(0,1,10, dtype='float')
]

The definition of the Genetic Algorithm

def fitness_function_factory(data=data, y_name='y', sample_size=100):

    def fitness_function(solution, solution_idx):
        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )
        
        X = data.drop(columns=[y_name])
        y = data[y_name]
        X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                            test_size=0.5)        

        train_idx = sample_without_replacement(n_population=len(X_train), 
                                              n_samples=sample_size)         
        
        test_idx = sample_without_replacement(n_population=len(X_test), 
                                              n_samples=sample_size) 
         
        model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
        fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
        
        return fitness 

    return fitness_function

And the instantiation of the Genetic Algorithm

cross_validate = pygad.GA(gene_space=gene_space,
                      fitness_func=fitness_function_factory(),
                      num_generations=100,
                      num_parents_mating=2,
                      sol_per_pop=8,
                      num_genes=len(gene_space),
                      parent_selection_type='sss',
                      keep_parents=2,
                      crossover_type="single_point",
                      mutation_type="random",
                      mutation_percent_genes=25)

cross_validate.best_solution()
>>>
ValueError: n_estimators must be an integer, got <class 'numpy.float64'>.

Any recommendations on resolving this error?

EDIT: I've tried the below to successful results:

model = RandomForestClassifier(n_estimators=gene_space[0][0])
model.fit(X,y)

So the issue does not lie with numpy->sklearn but with PyGAD.


Solution

  • There're 2 issues I've spotted here:

    1. pygad.GA does not derive the numerical type out of the relevant gene values of "gene_space" and simply convert all the numerical values into 'float'.
      In order to fix this, the "gene_type" parameter must be used to specify the respected types of gene values. https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#more-about-the-gene-type-parameter

    2. numpy.linspace() doesn't work as documented for customizing the space of values of each gene. This function leads to producing zeros for all genes while populating.
      So, it's better to use instead either this notation {"low": 50, "high": 200, "step": 25} or convert numpy.ndarray to list like numpy.linspace().tolist().

    gene_space

    gene_space = [
        # n_estimators
        {"low": 50, "high": 200, "step": 25},
        # min_samples_split,
        {"low": 2, "high": 10, "step": 5},
        # min_samples_leaf,
        {"low": 1, "high": 10, "step": 5},
        # min_impurity_decrease
        np.linspace(0, 1, 10).tolist()
    ]
    

    gene_type

    cross_validate = pygad.GA(
        gene_space=gene_space,
        fitness_func=fitness_function_factory(),
        num_generations=100,
        num_parents_mating=2,
        sol_per_pop=8,
        num_genes=len(gene_space),
        parent_selection_type='sss',
        keep_parents=2,
        crossover_type="single_point",
        mutation_type="random",
        mutation_percent_genes=25,
        gene_type=[int, int, int, float]
    )
    

    I tested this way

    import numpy as np
    import pandas as pd
    import pygad
    from numpy.random import default_rng
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.utils.random import sample_without_replacement
    
    gene_space = [
        # n_estimators
        {"low": 50, "high": 200, "step": 25},
        # min_samples_split,
        {"low": 2, "high": 10, "step": 5},
        # min_samples_leaf,
        {"low": 1, "high": 10, "step": 5},
        # min_impurity_decrease
        np.linspace(0, 1, 10).tolist()
    ]
    
    rng = default_rng()
    n = 1000
    data = pd.DataFrame({"x_1": rng.standard_normal(n), "x_2": rng.standard_normal(n), "y": rng.integers(0, 2, n)})
    
    
    def fitness_function_factory(data=data, y_name='y', sample_size=100):
    
        def fitness_function(solution, solution_idx):
    
            model = RandomForestClassifier(
                n_estimators=solution[0],
                min_samples_split=solution[1],
                min_samples_leaf=solution[2],
                min_impurity_decrease=solution[3]
            )
    
            X = data.drop(columns=[y_name])
            y = data[y_name]
            X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                                test_size=0.5)
    
            train_idx = sample_without_replacement(n_population=len(X_train),
                                                   n_samples=sample_size)
    
            test_idx = sample_without_replacement(n_population=len(X_test),
                                                  n_samples=sample_size)
    
            model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
            fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
    
            return fitness
    
        return fitness_function
    
    
    cross_validate = pygad.GA(
        gene_space=gene_space,
        fitness_func=fitness_function_factory(),
        num_generations=100,
        num_parents_mating=2,
        sol_per_pop=8,
        num_genes=len(gene_space),
        parent_selection_type='sss',
        keep_parents=2,
        crossover_type="single_point",
        mutation_type="random",
        mutation_percent_genes=25,
        gene_type=[int, int, int, float]
    )
    
    print(cross_validate.best_solution())
    
    (array([75, 2, 1, 0.5555555555555556], dtype=object), 0.5, 3)