python numpy machine-learning genetic-algorithm pygad

PyGAD is not receiving integer parameters according to documentation

I am trying to use PyGAD to optimize hyper-parameters in ML models. According to documentation

The gene_space parameter customizes the space of values of each gene ... list, tuple, numpy.ndarray, or any range like range, numpy.arange(), or numpy.linspace: It holds the space for each individual gene. But this space is usually discrete. That is there is a set of finite values to select from.

As you can see, the first element of gene_space, which corresponds to solution[0] in the Genetic Algorithm definition, is an array of integers. According to documentation, this should be a discrete space, which it is. However, when this array of integers (from np.linspace, which is okay to use), it is interpreted by Random Forest Classifier as a numpy.float64'> (see error in 3rd code block.)

I don't understand where this change of data type is occurring. Is this a PyGAD problem and how can I fix? Or is it a numpy -> sklearn problem?

gene_space = [ 
    # n_estimators
    np.linspace(50,200,25, dtype='int'),
    # min_samples_split, 
    np.linspace(2,10,5, dtype='int'),
    # min_samples_leaf,
    np.linspace(1,10,5, dtype='int'),
    # min_impurity_decrease
    np.linspace(0,1,10, dtype='float')
]

The definition of the Genetic Algorithm

def fitness_function_factory(data=data, y_name='y', sample_size=100):

    def fitness_function(solution, solution_idx):
        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )
        
        X = data.drop(columns=[y_name])
        y = data[y_name]
        X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                            test_size=0.5)        

        train_idx = sample_without_replacement(n_population=len(X_train), 
                                              n_samples=sample_size)         
        
        test_idx = sample_without_replacement(n_population=len(X_test), 
                                              n_samples=sample_size) 
         
        model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
        fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
        
        return fitness 

    return fitness_function

And the instantiation of the Genetic Algorithm

cross_validate = pygad.GA(gene_space=gene_space,
                      fitness_func=fitness_function_factory(),
                      num_generations=100,
                      num_parents_mating=2,
                      sol_per_pop=8,
                      num_genes=len(gene_space),
                      parent_selection_type='sss',
                      keep_parents=2,
                      crossover_type="single_point",
                      mutation_type="random",
                      mutation_percent_genes=25)

cross_validate.best_solution()
>>>
ValueError: n_estimators must be an integer, got <class 'numpy.float64'>.

Any recommendations on resolving this error?

EDIT: I've tried the below to successful results:

model = RandomForestClassifier(n_estimators=gene_space[0][0])
model.fit(X,y)

So the issue does not lie with numpy->sklearn but with PyGAD.

Solution

There're 2 issues I've spotted here:

pygad.GA does not derive the numerical type out of the relevant gene values of "gene_space" and simply convert all the numerical values into 'float'.
In order to fix this, the "gene_type" parameter must be used to specify the respected types of gene values. https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#more-about-the-gene-type-parameter
numpy.linspace() doesn't work as documented for customizing the space of values of each gene. This function leads to producing zeros for all genes while populating.
So, it's better to use instead either this notation {"low": 50, "high": 200, "step": 25} or convert numpy.ndarray to list like numpy.linspace().tolist().

gene_space

gene_space = [
    # n_estimators
    {"low": 50, "high": 200, "step": 25},
    # min_samples_split,
    {"low": 2, "high": 10, "step": 5},
    # min_samples_leaf,
    {"low": 1, "high": 10, "step": 5},
    # min_impurity_decrease
    np.linspace(0, 1, 10).tolist()
]

gene_type

cross_validate = pygad.GA(
    gene_space=gene_space,
    fitness_func=fitness_function_factory(),
    num_generations=100,
    num_parents_mating=2,
    sol_per_pop=8,
    num_genes=len(gene_space),
    parent_selection_type='sss',
    keep_parents=2,
    crossover_type="single_point",
    mutation_type="random",
    mutation_percent_genes=25,
    gene_type=[int, int, int, float]
)

I tested this way

import numpy as np
import pandas as pd
import pygad
from numpy.random import default_rng
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.random import sample_without_replacement

gene_space = [
    # n_estimators
    {"low": 50, "high": 200, "step": 25},
    # min_samples_split,
    {"low": 2, "high": 10, "step": 5},
    # min_samples_leaf,
    {"low": 1, "high": 10, "step": 5},
    # min_impurity_decrease
    np.linspace(0, 1, 10).tolist()
]

rng = default_rng()
n = 1000
data = pd.DataFrame({"x_1": rng.standard_normal(n), "x_2": rng.standard_normal(n), "y": rng.integers(0, 2, n)})


def fitness_function_factory(data=data, y_name='y', sample_size=100):

    def fitness_function(solution, solution_idx):

        model = RandomForestClassifier(
            n_estimators=solution[0],
            min_samples_split=solution[1],
            min_samples_leaf=solution[2],
            min_impurity_decrease=solution[3]
        )

        X = data.drop(columns=[y_name])
        y = data[y_name]
        X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                            test_size=0.5)

        train_idx = sample_without_replacement(n_population=len(X_train),
                                               n_samples=sample_size)

        test_idx = sample_without_replacement(n_population=len(X_test),
                                              n_samples=sample_size)

        model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
        fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])

        return fitness

    return fitness_function


cross_validate = pygad.GA(
    gene_space=gene_space,
    fitness_func=fitness_function_factory(),
    num_generations=100,
    num_parents_mating=2,
    sol_per_pop=8,
    num_genes=len(gene_space),
    parent_selection_type='sss',
    keep_parents=2,
    crossover_type="single_point",
    mutation_type="random",
    mutation_percent_genes=25,
    gene_type=[int, int, int, float]
)

print(cross_validate.best_solution())

(array([75, 2, 1, 0.5555555555555556], dtype=object), 0.5, 3)