Search code examples
scikit-learnmulticlass-classificationgridsearchcvimbalanced-datasmote

Unbalanced data set - how to optimize hyperparams via grid search?


I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. So far I have used class_weights='balanced' and selected the best hyperparameters based on the average of the f1-scores. However, the data set is very unbalanced, i.e. if I chose GridSearchCV with cv=10, then some minority classes are not represented in the validation data. I'm thinking of using SMOTE, but I see the problem here that I would have to set k_neighbors=1 because in some minority classes there are often only 1-2 samples. Does anyone have a tip how to optimized the hyperparameters in this case? Are there any alternatives?

Many thanks for every hint


Solution

  • I would like to optimize the hyperparameters C and Gamma of an SVC by using grid search for an unbalanced data set. Does anyone have a tip how to optimized the hyperparameters in this case?

    You could use the GridSearchCV() function doing something like:

    from sklearn.model_selection import GridSearchCV 
    
    
    param_grid = {'C': [0.1, 5, 50, 100],  
                  'gamma': [1, 0.5, 0.1, 0.01]}  
    
    model = GridSearchCV(SVC(), param_grid, refit = True) 
    
    model.fit(X_train, y_train)
    

    You could use RandomizedSearchCV in order to explore more options.

    I'm thinking of using SMOTE, but I see the problem here that I would have to set k_neighbors=1

    Did you try ADASYN?

    Are there any alternatives?

    When I am really lost, I try a "last resource". It is a tool called tpot.

    Just doing an example like this one:

    tpot = TPOTClassifier(generations=5, population_size=50, scoring='roc_auc', verbosity=2, random_state=42)
    tpot.fit(X_train, y_train)
    print(tpot.score(X_test, y_test))
    tpot.export('tpot_results.py')
    

    It will output an sklearn code, with an algorithm and a pipeline, in this case the tpot_results.py would be:

    tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
    features = tpot_data.drop('target', axis=1)
    training_features, testing_features, training_target, testing_target = \
                train_test_split(features, tpot_data['target'], random_state=42)
    
    # Average CV score on the training set was: 0.9826086956521738
    exported_pipeline = make_pipeline(
        Normalizer(norm="l2"),
        KNeighborsClassifier(n_neighbors=5, p=2, weights="distance")
    )
    # Fix random state for all the steps in exported pipeline
    set_param_recursive(exported_pipeline.steps, 'random_state', 42)
    
    exported_pipeline.fit(training_features, training_target)
    results = exported_pipeline.predict(testing_features)
    

    Be careful with overfitting problems when using this tool, but is one alternative that I can recommend you.