Sklearn pipeline with LDA and KNN

I try to use LinearDiscriminantAnalysis (LDA) class from sklearn as preprocessing part of my modeling to reduce the dimensionality of my data, and after applied a KNN classifier. I know that a good pratice is to use pipeline to bring together preprocessing and modeling part.

I also use the method cross_validate to avoid overfitting using cross validation. But when I build my pipeline, and pass it to the cross_validate method, it seems that only LDA is used to classify my data, since LDA can be used as a classifier too.

I don't understand why, it is like since the LDA can predict the class, it just use it without the KNN or something like that. I may be using the Pipeline class wrong.

Below you can find the code with the pipeline (LDA + KNN) and a version with just LDA, the results are exactly the same. Note that when I transform (reduce) the data before, and use the reduced data into a cross_validate method with KNN my result are way better.

# Define the pipeline to use LDA as preprocessing part
pipeline2 = Pipeline([
    ('lda', lda),
    ('knn', knn)
])

# Use stratified cross validation on pipeline (LDA and KNN) classifier
result_test = pd.DataFrame(cross_validate(
    pipeline2,
    X_train_reduced,
    y_train,
    return_train_score=True,
    cv=3,
    scoring=['accuracy']
))

# Get mean train and test accuracy
print(f"Mean train accuracy: {result_test['train_accuracy'].mean():.3f}")
print(f"Mean validation accuracy: {result_test['test_accuracy'].mean():.3f}")

Mean train accuracy: 1.000
Mean validation accuracy: 0.429

# Define the pipeline to use LDA as preprocessing part
pipeline2 = Pipeline([
    ('lda', lda),
    #('knn', knn) THE KNN IS COMMENT IN THIS CASE!!
])

# Use stratified cross validation on pipeline (LDA and KNN) classifier
result_test = pd.DataFrame(cross_validate(
    pipeline2,
    X_train_reduced,
    y_train,
    return_train_score=True,
    cv=3,
    scoring=['accuracy']
))

# Get mean train and test accuracy
print(f"Mean train accuracy: {result_test['train_accuracy'].mean():.3f}")
print(f"Mean validation accuracy: {result_test['test_accuracy'].mean():.3f}")

Mean train accuracy: 1.000
Mean validation accuracy: 0.429

Note that the data used is quiet complex, it is from MRI images, and it has been already reduced using PCA to filter noise on images.

Thank you for your help!

Solution

I think this is reasonable behavior, though not guaranteed to happen. The LDA.transform is reducing to the top two (=n_classes-1) dimensions in its internal model, and the 5-NN model then ends up predicting nearly the same way as the full LDA.predict (I guess because the next most important dimensions don't add much?). If you pressed it, you might find that the KNN has wavier prediction thresholds than the nice linear ones from the LDA, but since the LDA can already perfectly predict the training set, that doesn't cause much difference.

That said, a test accuracy of 0.43 is quite a lot lower. I suppose that could be because the top two dimensions in LDA, while really good for separating the training set, aren't very good on the test set (for at least some of the fold-splits). I'd be curious to know how different the top two dimensions actually are across folds.

Note that when I transform (reduce) the data before, and use the reduced data into a cross_validate method with KNN my result are way better.

That's due to data leakage: the LDA got to see the entire training set, leaking information about the test folds to each KNN. Related to the previous paragraph, the top two dimensions selected are good for all of the fold-splits.