Search code examples
machine-learningscikit-learnsvmpca

PCA on data and training with SVM with K-fold CV and Gridsearch


I need to train a SVM model using LinearSVC and a 10-fold cross-validation with an internal 2-fold Gridsearch to optimze gamma and C. But I also have to apply PCA on my data to reduce its size. Should I apply PCA before or within the loop where the CV and training of the model happens? In the latter case I would have different numbers of Principal Components for each loop, but is there a disadvantage on that?


Solution

  • The best solution would be to create a sklearn Pipeline and put both steps (PCA and LinarSvc within it). This will create an object that implement fit() and predict() and that can be used within a GridSearchCV.

    from sklearn.svm import LinearSVC
    from sklearn.pipeline import Pipeline
    from sklearn.decomposition import PCA
    from sklearn.model_selection import GridSearchCV
    
    pipe = Pipeline([('pca', PCA()),
                     ('clf', LinearSVC())])
    params = {
        'pca__n_components' : [2, 5, 10, 15],
        'clf__C' : [0.5, 1, 5, 10],
    }
    
    gs = GridSearchCV(estimator=pipe, param_grid=params)
    gs.fit(X_train, y_train)