Search code examples
scikit-learnsvmgridsearchcvone-class-classification

OneClassSVM performances not repeatable. Why?


I train a OneClassSVM for anomaly detection using GridSearchCV for hyperparameters tuning.

What I do is 1-fold cross validation, passing it my class of interest to train on for each HP configuration and a mix of my class of interest and other classes for validation. I set "refit=False" in GridSearchCV as I don't want it to retrain on everything (all the observations of my class of interest plus the rest).

The results of the HP tuning gives me a best metric.

After that, for sake of verification, I train a OneClassSVM without GridSearchCV with a simple model.fit() with the train set passed to GridSearchCV, and evaluate it on the same validation set I had passed it too. This gives me a slightly different metric.

So my question is: Is there some randomness in OneClassSVM? I saw from old versions of the SkLearn doc that this model had a "random_state" parameter, which is not available anymore. I thought that this parameter coupled with "max_iter=-1" could maybe be the cause of this none repeatability.

I triple-checked my code for folds creation, etc... so I don't think it is a mistake on this part.

Below is an example of my code:

# Instantiation of a PCA
pca = PCA()

# Instantiation of the StandardScaler
scaler = StandardScaler()

# Numerical variables
numeric_features = X.select_dtypes([np.number]).columns

# Instantiation of the preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("scaling", scaler, numeric_features)
    ]
)

# Creation of the pipeline
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("pca", pca),
    ("estimator", OneClassSVM())
])

# Definition of the models and hyperparameters configurations to try
parameters = [
    {
        "pca__n_components": [3, 5, 7],
        "estimator__kernel": ["linear", "poly", "sigmoid"],
        "estimator__degree": [2, 3, 4, 5],
        "estimator__gamma": ["scale", "auto"],
        "estimator__nu": [0.01, 0.05, 0.1],  
        "estimator__max_iter": [-1]
    }
]   
                       
# Hyper-parameters optimization
grid_search = GridSearchCV(pipeline, parameters, cv=folds_indices, scoring="f1_weighted", n_jobs=1, refit=False, return_train_score=False, verbose=3)
grid_search.fit(X, y)
print("\nBest score is:")
print(f"{grid_search.best_score_:.4f}")
print("\n")
print("Obtained with hyperparameters:")
print(grid_search.best_params_)
print("\n")


# Instantiation of a OneClassSVM model with the best parameters found
model = OneClassSVM(
                    degree=grid_search.best_params_["estimator__degree"],
                    gamma=grid_search.best_params_["estimator__gamma"],  
                    kernel=grid_search.best_params_["estimator__kernel"], 
                    nu=grid_search.best_params_["estimator__nu"],
                    max_iter=grid_search.best_params_["estimator__max_iter"]
                    )

With X being my features and y the observations label. X contains both train and validation observations which are indexed in the GridSearch via "cv=folds_indices", which is a tuple. I don't set "cv" to a integer because it is a OneClass model. Doing so would train my model on the validation set as well which contains a mix of classes.

I also set "refit" to False because I don't want to train on all the train+validation data at the end.

In the end, I create a model from scratch with the best conf of HP found with GridSearch. I then train this model on the same data that were used for training in the GridSearch and evaluate it with ".predict()" on the validation set that was used for validation in GridSearch. Doing so gives me different results at each run. I did fit() + predict() several times and I get slightly different results each times, the HP and data sets being the same.

When checking "model.n_iter_", I see that this number changes. I thought that maybe OneClassSVM shuffles data and/or processes them by batch iteratively, thus causing different conditions each time. Fixing "max_iter" to a defined number doesn't fix my problem (metrics still change).

Many thanks in advance

Cheers

Antoine


Solution

  • Depending on the size of your dataset, PCA() may switch to a randomised PCA algorithm. If this happens then you'll see some variability in PCA results each time you run it. To eliminate this variability, set the random_state= parameter in PCA(). You could alternatively force PCA() to stick with a non-stochastic algorithm by setting its svd_solver= parameter.