python machine-learning scikit-learn anomaly-detection

Implementing feature selection

One problem I came across when trying to predict with a feature selected data set, is that once you have selected certain features, if you were to predict on the test data set, the test data set features would not align because the training data set would have less features due to feature selection. How do you implement feature selection properly such that the test data set would have the same features as the training data set?

Example:

 from sklearn.datasets import load_iris
 from sklearn.feature_selection import SelectKBest
 from sklearn.feature_selection import chi2
 iris = load_iris()
 X, y = iris.data, iris.target
 X.shape
(150, 4)
 X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
 X_new.shape
(150, 2)

Solution

You have to transform your testing set too... And dont use fit_transform, but just transform. This requires you to save your SelectKBest object, so something to the effect of:

selector = SelectKBest(chi2, k=2)
X_train_clean = selector.fit_transform(X_train, y_train)
X_test_clean = selector.transform(X_test)