Meaning of batch_size in model.evaluate()

I am building a plain vanilla FNN and want to evaluate my model after training. I was wondering what impact the batch_size has when evaluating the model on a test set. Of course it is relevant for training as it determines the number of samples to be fed to the network before computing the next gradient. It is also clear that it can be needed when predicting values for a (statefull) RNN. But it is not clear to me why it is needed when evaluating the model especially a FNN. Furthermore, I get slightly different values when I evaluate the model on the same test set but with different batch sizes. Consider the following toy example:

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

# function to be learned
def f(x):
    return x[0] + x[1] + x[2]

# sample training and test points on a rectangular grid
x_train = np.random.uniform(low = -10, high = 10, size = (50,3))
y_train = np.apply_along_axis(f, 1, x_train).reshape(-1,1)

x_test = np.random.uniform(low = -10, high = 10, size = (50,3))
y_test = np.apply_along_axis(f, 1, x_test).reshape(-1,1)

model = Sequential()
model.add(Dense(20, input_dim = 3, activation = 'tanh'))
model.add(Dense(1))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mse',
      optimizer=sgd)
model.fit(x_train, y_train, batch_size = 10, epochs = 30, verbose = 0)

model.evaluate(x_test, y_test, batch_size = 10)
model.evaluate(x_test, y_test, batch_size = 20)
model.evaluate(x_test, y_test, batch_size = 30)
model.evaluate(x_test, y_test, batch_size = 40)
model.evaluate(x_test, y_test, batch_size = 50)

The values are very similar but nevertheless different. Where does this come from? Shouldn't the following be always true?

from sklear.metrics import mean_squared_error as mse
0 == model.evaluate(x_test, y_test) - mse(model.predict(x_test), y_test)

Solution

No, they don't have to be the same. If you combine floating point math with parallelism, you don't get reproducible results as then (a + b) + c is not the same as a + (b + c), when a, b, and c, are floating point numbers.

The evaluate function of Model has a batch size just in order to speed-up evaluation, as the network can process multiple samples at a time, and with a GPU this makes evaluation much faster. I think the only way to reduce the effect of this would be to set batch_size to one.