Different results when evaluate the model performance on test data using model.evaluate and model.predict

I have a question regarding model.evaluate() and model.predict() functions in Keras. I built a simple LSTM model in Keras and want to test the model performance on the test dataset. I considered the following two ways to compute the metric on the test dataset:

Use model.evaluate() method
Use model.predict() method to obtain the fitted values and compute the metric manually

However, I ended up with different results. In addition, the results for model.evaluate() method also depend on the value for the batch_size argument. Based on my understanding and this post, they should have the same results. Here is the code that can replicate the results:

import tensorflow as tf
from keras.models import Model
from keras.layers import Dense, LSTM, Activation, Input
import numpy as np
from tqdm.notebook import tqdm
import keras.backend as K
from keras.callbacks import ModelCheckpoint, EarlyStopping

class VLSTM:
    def __init__(self, input_shape=(6, 1), nb_output_units=1, nb_hidden_units=128, dropout=0.0, 
                 recurrent_dropout=0.0, nb_layers=1):
        self.input_shape = input_shape
        self.nb_output_units = nb_output_units
        self.nb_hidden_units = nb_hidden_units
        self.nb_layers = nb_layers
        self.dropout = dropout
        self.recurrent_dropout = recurrent_dropout

    def build(self):
        inputs = Input(shape=self.input_shape)
        outputs = LSTM(self.nb_hidden_units)(inputs)
        outputs = Dense(1, activation=None)(outputs)
        return Model(inputs=[inputs], outputs=[outputs])
    
def RMSE(output, target):
    return K.sqrt(K.mean((output - target) ** 2))

n_train = 500
n_val = 100
n_test = 250 

X_train = np.random.rand(n_train, 6, 1)
Y_train = np.random.rand(n_train, 1)
X_val = np.random.rand(n_val, 6, 1)
Y_val = np.random.rand(n_val, 1)
X_test = np.random.rand(n_test, 6, 1)
Y_test = np.random.rand(n_test, 1)

input_shape = (X_train.shape[1], X_train.shape[2])
model = VLSTM(input_shape=input_shape)
m = model.build()
m.compile(loss=RMSE,
              optimizer='adam',
              metrics=[RMSE])

callbacks = []
callbacks.append(EarlyStopping(patience=30))


# train model
hist = m.fit(X_train, Y_train, \
             batch_size=32, epochs=10, shuffle=True, \
             validation_data=(X_val, Y_val), callbacks=callbacks)

# Use evaluate method with default batch size
test_mse = m.evaluate(X_test, Y_test)[1]
print("Mse is {} using evaluate method with default batch size".format(test_mse))

# Use evaluate method with batch size 1
test_mse = m.evaluate(X_test, Y_test, batch_size=1)[1]
print("Mse is {} using evaluate method with batch size = 1".format(test_mse))

# Use evaluate method with batch size = n_test
test_mse = m.evaluate(X_test, Y_test, batch_size=n_test)[1]
print("Mse is {} using evaluate method with batch size = n_test".format(test_mse))

# Use pred method and compute RMSE manually
Y_test_pred = m.predict(X_test)
test_mse = np.sqrt( ((Y_test_pred - Y_test) ** 2).mean())
print("Mse is {} using evaluate method with batch size = 1".format(test_mse))

After running the codes, here are the results:

Mse is 0.3068242073059082 using evaluate method with default batch size

Mse is 0.26647186279296875 using evaluate method with batch size = 1

Mse is 0.30763307213783264 using evaluate method with batch size = n_test

Mse is 0.3076330596820157 using predict method

And looks like using mode.predict() and model.evaluate() with batch size = n_test gives the same results. Can anyone explain it? Thanks in advance!

Solution

Yes, your guess is correct, the mse calculated with the predict is indeed equal to the evaluate with batch_size=len(dataset).
It's very easy to understand because when you calculate the mse with predict you have not divided the your dataset into batches to calculate it, you just calculate all at once.

Obviously you can calculate you mse with predict also dividing into batches like this:

Y_test_pred_batches = np.split(Y_test_pred, 5 ,axis=0) #batch_size = 250/5=50 
Y_test_batches = np.split(Y_test, 5 ,axis=0)
batch_rmss = []
for y_pred, y_true in zip(Y_test_pred_batches, Y_test_batches):
    batch_rmss.append(rms(y_pred, y_true))
np.mean(batch_rmss)

The output of this is: 0.28436336682976376 Now with evaluate:

test_mse = m.evaluate(X_test, Y_test, batch_size=50)[1]
test_mse

The output of this is: 0.28436335921287537 So basically they are the same.

If you try with np.split(Y_test_pred, 250 ,axis=0) which makes you the batch size 1, the output in my case is 0.24441334738835332. And with evaluate batch_size=1 the output is 0.244413360953331. So you can see it's the same.