Imagine we have the following pipeline:
example_pipe = Pipeline(steps=[
('scaler', StandardScaler()),
('selector', SelectKBest(k=len(X.columns)-5)),
('classifier', KNeighborsClassifier())
])
Now we want to get the performance of the pipeline with:
# 1)
cross_val_score(example_pipe, X, y, cv=5, scoring='accuracy').mean()
# 2)
example_pipe.fit(X_train, y_train)
example_pipe.score(X_test, y_test)
How is the first different from the second in regards to the score we get (except of course that it does cross-validation)? Do we have to call example_pipe.fit()
before using cross_val_score()
.
I've found the following methods in the documentation, but it's a bit confusing because I thought that calling .fit()
already implies calling .transform()
.
fit(X[, y]) --> Fit the model
fit_predict(X[, y]) --> Applies fit_predict of last step in pipeline after transforms.
fit_transform(X[, y]) --> Fit the model and transform with the final estimator
score(X[, y, sample_weight]) --> Apply transforms, and score with the final estimator
Do we have to call example_pipe.fit() before using cross_val_score()?
If you go to Scikit-Learn Documentation, you find the answer:
cross_val_score
first fits your example_pipe
, then gets the score of the cross validation.