Search code examples
pythonscikit-learnpipeline

Scikit Learn Pipeline: Calling .fit() and .score() vs cross_val_score()


Imagine we have the following pipeline:

example_pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(k=len(X.columns)-5)),
    ('classifier', KNeighborsClassifier())
])

Now we want to get the performance of the pipeline with:

# 1)
cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()

# 2)
example_pipe.fit(X_train, y_train)
example_pipe.score(X_test, y_test)

How is the first different from the second in regards to the score we get (except of course that it does cross-validation)? Do we have to call example_pipe.fit() before using cross_val_score().

I've found the following methods in the documentation, but it's a bit confusing because I thought that calling .fit() already implies calling .transform().

fit(X[, y]) --> Fit the model

fit_predict(X[, y]) --> Applies fit_predict of last step in pipeline after transforms.

fit_transform(X[, y]) --> Fit the model and transform with the final estimator

score(X[, y, sample_weight]) --> Apply transforms, and score with the final estimator


Solution

  • Do we have to call example_pipe.fit() before using cross_val_score()?

    If you go to Scikit-Learn Documentation, you find the answer:

    enter image description here

    cross_val_score first fits your example_pipe, then gets the score of the cross validation.