Search code examples
machine-learningscikit-learnsvmnormalizationpca

StandardScaler with make_pipeline


If I use make_pipeline, do I still need to use fit and transform functions to fit my model and transform or it will perform these functions itself?
Also, does StandardScaler also perform the normalization or only the scaling?
Explaining the code: I want to apply PCA and later applying normalization with svm.

pca = PCA(n_components=4).fit(X) 
X = pca.transform(X)

# training a linear SVM classifier 5-fold
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

clf = make_pipeline(preprocessing.StandardScaler(), SVC(kernel = 'linear'))
   scores = cross_val_score(clf, X, y, cv=5)

Also abit confused what happens if I don't use the fit function in the below code:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

clf = SVC(kernel = 'linear', C = 1)
scores = cross_val_score(clf, X, y, cv=5)


Solution

  • StandardScaler does both normalization and scaling.

    cross_val_score() will fit (transform) your data set for you, so you don't need to call it explicitly.

    A bit more common approach would be to put all steps (StandardScale, PCA, SVC) in one pipeline and use GridSearchCV for tuning hyperparameters and chosing best parameters (estimators).

    Demo:

    pipe = Pipeline([
            ('scale, StandardScaler()),
            ('reduce_dims', PCA(n_components=4)),
            ('clf', SVC(kernel = 'linear', C = 1))
    ])
    
    param_grid = dict(reduce_dims__n_components=[4,6,8],
                      clf__C=np.logspace(-4, 1, 6),
                      clf__kernel=['rbf','linear'])
    
    grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
    grid.fit(X_train, y_train)
    print(grid.score(X_test, y_test))