Search code examples
scikit-learnpermutationcross-validationstandardized

Scaling in scikit-learn permutation_test_score


I'm using the scikit-learn "permutation_test_score" method to evaluate my estimator performances significance. Unfortunately, I cannot understand from the scikit-learn documentation if the method implements any scaling on data. I use to standardise my data through a StandardScaler, to apply the training set standardisation to the testing set.


Solution

  • The function itself does not apply any scaling.

    Here is an example from the documentation:

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.svm import SVC
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import permutation_test_score
    from sklearn import datasets
    
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    n_classes = np.unique(y).size
    
    # Some noisy data not correlated
    random = np.random.RandomState(seed=0)
    E = random.normal(size=(len(X), 2200))
    
    # Add noisy data to the informative features for make the task harder
    X = np.c_[X, E]
    
    svm = SVC(kernel='linear')
    cv = StratifiedKFold(2)
    
    score, permutation_scores, pvalue = permutation_test_score(
        svm, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)
    

    However, what you may want to do is to pass in the permutation_test_score a pipeline where you apply the scaling.

    Example:

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    
    pipe = Pipeline([('scaler', StandardScaler()), ('clf', SVC(kernel='linear'))])
    score, permutation_scores, pvalue = permutation_test_score(
            pipe, X, y, scoring="accuracy", cv=cv, n_permutations=100, n_jobs=1)