Search code examples
machine-learningscikit-learnpipelinefeature-selection

How to train model with features selected by SelectKBest?


I am using SelectKBest() in Sklearn's Pipeline() class to reduce the number of features down from 30 to the 5 best features. When I fit the classifer, I get different test results as expected with feature selection. However I spotted an error in my code which doesn't seem to cause an actual error in runtime.

When I call predict(), I realised that it was still being given all 30 features as input as if feature selection wasn't occurring. Even though I only trained the model on the 5 best features. Surely giving 30 features to an SVM to predict a class will crash if it was only trained on the 5 best features?

In my train_model(df) function, my code looks as follows:

def train_model(df):
    x,y = balance_dataset(df)
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

    feature_selection = SelectKBest()

    pipe = Pipeline([('sc', preprocessing.MinMaxScaler()),
                    ('feature_selection', feature_selection),
                    ('SVM', svm.SVC(decision_function_shape = 'ovr', kernel = 'poly'))])

    candidate_parameters = [{'SVM__C': [0.01, 0.1, 1], 'SVM__gamma': [0.01, 0.1, 1], 'feature_selection__k': [5]}]

    clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
    clf.fit(X_train, y_train )

    return clf 

However this is when what happens when I call trade():

def trade(df):
    clf = train_model(df) 

    for index, row in trading_set.iterrows(): 

        features = row[:-3] #features is now an array of 30 features, even though model is only trained on 5

        if trade_balance > 0:
            trades[index] = trade_balance
            if clf.predict(features) == 1: #So this should crash and give an input Shape error, but it doesn't
            #Rest of code unneccesary#

So my question is, how do I know that the model is really being trained on only the 5 best features?


Solution

  • Your code is correct, and there is no reason why it should throw you any error. You are confused between the pipeline object and the model itself, which is only one block of the pipeline.

    In your example, the pipeline is taking 30 features, scaling them, selecting the 5 best, then training an SVM on these 5 best features. So your SVM has been trained on 5 best features, but you still need to pass all 30 features to your pipeline, because your pipeline expects data to come in in the same format as during the training.