python machine-learning keras scikit-learn cross-validation

Why is my cross_val_score() accuracy very high, but my test accuracy very low?

On using the KerasWrapper, I get a very high training accuracy: above 95%

X_train, X_test, y_train, y_test = train_test_split(train_data, train_labels, shuffle=True, test_size=0.3, random_state=42)

estimator = KerasClassifier(build_fn=build_model(130, 130, 20000), epochs=2, batch_size=128, verbose=1)
folds = KFold(n_splits=3, shuffle=True, random_state=128)
results = cross_val_score(estimator=estimator, X=X_train, y=y_train, cv=folds)

However, my prediction accuracy is not great at all. Is it a classic case of overfitting?

prediction = cross_val_predict(estimator=estimator, X=X_test, y=y_test, cv=folds)

metrics.accuracy_score(y_test_converted, prediction)
# accuracy is 0.03%

How can I improve my testing accuracy? Thanks

Solution

Is it a classic case of overfitting?

It is not - it's just that your process is wrong.

cross_val_predict is not meant to be applied to the test data, as you are doing here. The low accuracy is probably due to the fact that you try to retrain your model in each fold of the test dataset, which is way smaller that your training one.

The correct procedure is - fit your estimator with the training data, get predictions on the test set, and then calculate the test accuracy, i.e.:

estimator.fit(X_train, y_train)
y_pred = estimator.predict(X_test)
metrics.accuracy_score(y_test, y_pred)