Search code examples
scikit-learnpcacross-validationfeature-selection

how to cross validate pca in sklearn pipeline without overfitting?


My input is time series data. I want to decompose the dataset with PCA (I dont want to do PCA on the entire dataset first because that would be overfitting) and then use feature selection on each component (fitted on a KNN Regressor model).

This is my code so far:

tscv = TimeSeriesSplit(n_splits=10)
pca = PCA(n_components=.5,svd_solver='full').fit_transform()
knn = KNeighborsRegressor(n_jobs=-1)
sfs = SequentialFeatureSelector(estimator=knn,n_features_to_select='auto',tol=.001,scoring=custom_scorer,n_jobs=-1)
pipe = Pipeline(steps=[("pca", pca), ("sfs", sfs), ("knn", knn)])
cv_score = cross_val_score(estimator=pipe,X=X,y=y,scoring=custom_scorer,cv=tscv,verbose=10)
print(np.average(cv_score),' +/- ',np.std(cv_score))
print(X.columns)

The problem is I want to make sure PCA isnt looking over the entire dataset when it calculates which features variance. I also want it to be fit transformed, but it doesnt work. With the following error codes:

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<bound method PCA.fit_transform of PCA(svd_solver='full')>' (type <class 'method'>) doesn't

or

TypeError: fit_transform() missing 1 required positional argument: 'X'

Solution

  • You should not use pca = PCA(...).fit_transform nor pca = PCA(...).fit_transform() when defining your pipeline.

    Instead, you should use pca = PCA(...). The fit_transform method is automatically called within the pipeline during the model fitting (in cross_val_score).