tensorflow keras scikit-learn neural-network cross-validation

Different results when using Manual KFold-Cross validation vs. KerasClassifier-KFold Cross Validation

I've been struggling to understand why two similar Kfold-cross validations result in two different averages.

When I use a manual KFold approach (with Tensorflow and Keras)

cvscores = []
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=3)
for train, test in kfold.split(X, y):
  model = create_baseline()
  model.fit(X[train], y[train], epochs=50, batch_size=32, verbose=0)
  scores = model.evaluate(X[test], y[test], verbose=0)
  #print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
  cvscores.append(scores[1] * 100)

print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

I get

65.89% (+/- 3.77%)

When I use the KerasClassifier wrapper from scikit

estimator = KerasClassifier(build_fn=create_baseline, epochs=50, batch_size=32, verbose=0)
kfold = StratifiedKFold(n_splits=10,shuffle=True, random_state=3)
results = cross_val_score(estimator, X, y, cv=kfold, scoring='accuracy')
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

I get

63.82% (5.37%)

Additionally, when using KerasClassifier the following warning appears

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/wrappers/scikit_learn.py:241: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).

Do the results differ because KerasClassifier uses predict_classes() while the manual Tensorflow/Keras approach uses just predict()? If so, which approach is more reasonable?

My model looks like this

def create_baseline():
model = tf.keras.models.Sequential()
model.add(Dense(8, activation='relu', input_shape=(12,)))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

Solution

The two CV-results do not look too different, they are both within each others standard deviation.

You fixed the seed for the StratifiedKFold class, that's good. However there is additional randomness you should take control of and that comes from the weight initialization. Make sure you initialize your model for each CV-run with different weights, but use the same 10 initializations for both cross-validations, manual and automatic. You can pass an initializer to each layer, they have a seed argument as well. In general you should fix all possible seeds (np.random.seed(3), tf.set_random_seed(3)).

What happens if you run cross_val_score() or your manual version twice? Do you get the same results / numbers?