keras scikit-learn neural-network regression mlp

Different loss values and accuracies of MLP regressor in keras and scikit-learn

I have a neural network with one hidden layer implemented in both Keras and scikit-learn for solving a regression problem. In scikit-learn I used the MLPregressor class with mostly default parameters and in Keras I have a hidden Dense layer with parameters set to the same defaults as scikit-learn (which uses Adam with same learning rate and epsilon and a batch_size of 200). When I train the networks the scikit-learn model has a loss value that is about half of keras and its accuracy (measured in mean absolute error) is also better. Shouldn't the loss values be similar if not identical and the accuracies also be similar? Has anyone experienced something similar and able to make the Keras model more accurate?

Scikit-learn model:

clf = MLPRegressor(hidden_layer_sizes=(1600,), max_iter=1000, verbose=True, learning_rate_init=.001)

Keras model:

inputs = keras.Input(shape=(cols,))
x = keras.layers.Dense(1600, activation='relu', kernel_initializer="glorot_uniform", bias_initializer="glorot_uniform", kernel_regularizer=keras.regularizers.L2(.0001))(inputs)
outputs = keras.layers.Dense(1,kernel_initializer="glorot_uniform", bias_initializer="glorot_uniform", kernel_regularizer=keras.regularizers.L2(.0001))(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(epsilon=1e-8, learning_rate=.001),loss="mse")
model.fit(x=X, y=y, epochs=1000, batch_size=200)

Solution

It is because the formula of mean squared loss(MSE) from scikit-learn is different from that of tensorflow.

From the source code of scikit-learn:

def squared_loss(y_true, y_pred):
    return ((y_true - y_pred) ** 2).mean() / 2

while MSE from tensorflow:

backend.mean(math_ops.squared_difference(y_pred, y_true), axis=-1)

As you can see the scikit-learn one is divided by 2, coherent with what you said:

the scikit-learn model has a loss value that is about half of keras

That implied the models from keras and scikit-learn actually achieved similar performance. That also implied learning rate 0.001 in scikit-learn is not equivalent to the same learning rate in tensorflow.

Also, another smaller but significant difference is the formula of L2 regularization.

From the source code of scikit-learn,

# Add L2 regularization term to loss
values = 0
for s in self.coefs_:
   s = s.ravel()
   values += np.dot(s, s)
loss += (0.5 * self.alpha) * values / n_samples

while that of tensorflow is loss = l2 * reduce_sum(square(x)).

Therefore, with the same l2 regularization parameter, tensorflow one has stronger regularization, which will result in poorer fit to the training data.