Can I use MSE as loss function and label encoding in classification problem?

from keras.datasets import mnist
from keras import models, layers
from keras.utils import to_categorical

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
 
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,))) 
network.add(layers.Dense(10, activation='softmax')) 

network.compile(optimizer='rmsprop',
                loss='mean_squared_error',
                metrics=['accuracy'])


train_images = train_images.reshape((60000, 28 * 28)) 
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255        

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

network.fit(train_images, train_labels, epochs=5, batch_size=128)

test_loss, test_acc = network.evaluate(test_images, test_labels, batch_size=128)

print("test_acc: ", test_acc)

Epoch 1/5
60000/60000 [==============================] - 2s 41us/step - loss: 0.2600 - acc: 0.9244
Epoch 2/5
60000/60000 [==============================] - 2s 34us/step - loss: 0.1055 - acc: 0.9679
Epoch 3/5
60000/60000 [==============================] - 2s 33us/step - loss: 0.0688 - acc: 0.9791
Epoch 4/5
60000/60000 [==============================] - 2s 35us/step - loss: 0.0504 - acc: 0.9848
Epoch 5/5
60000/60000 [==============================] - 2s 38us/step - loss: 0.0373 - acc: 0.9889
10000/10000 [==============================] - 0s 18us/step
test_acc:  0.9791

It seems that there is no problem in training process, but I'm not sure how MSE is calculated. In this case, does keras(or tensorflow) automatically convert label encoding to one-hot encoding when calculating MSE?

Solution

You have manually converted your labels to one-hot encoding already via :

train_labels = to_categorical(train_labels)

As your softmax layer contains 10 nodes I will assume you intended for classification of 10 labels, meaning train_labels will look something like:

[
 [0,0,0,1,0,0,0,0,0,0],...  <--- One of these per training row
]

See the documentation on this.

The softmax output for that row may look like :

[0.033,0.45,0.01,0.9,0,0,0.5,0.4,0.3,0.95]

As explained in this handy resource:

The softmax function will output a probability of class membership for each class label and attempt to best approximate the expected target for a given input.

For example, if the integer encoded class 1 was expected for one example, the target vector would be:

[0, 1, 0] The softmax output might look as follows, which puts the most weight on class 1 and less weight on the other classes.

[0.09003057 0.66524096 0.24472847]

And then the mean squared error is calculated on those two sets of data, with the true labels y_true as per the to_categorical output and the predicted labels y_pred being the softmax output from your network.

From the tensorflow source code on MSE, this works by:

First calculating the difference between y_true and y_pred and squaring the result i.e. with the above two:

import tensorflow as tf
from tensorflow.python.keras import backend as K
from tensorflow.python.ops import math_ops

y_true = [0,0,0,1,0,0,0,0,0,0]
y_pred = [0.033,0.45,0.01,0.9,0,0,0.5,0.4,0.3,0.95]

math_ops.squared_difference(y_pred, y_true)


<tf.Tensor: shape=(10,), dtype=float32, numpy=
array([1.0890000e-03, 2.0249999e-01, 9.9999997e-05, 1.0000004e-02,
       0.0000000e+00, 0.0000000e+00, 2.5000000e-01, 1.6000001e-01,
       9.0000004e-02, 9.0249997e-01], dtype=float32)>

And then the mean of the result:

K.mean(math_ops.squared_difference(y_pred, y_true))
<tf.Tensor: shape=(), dtype=float32, numpy=0.1616189>

This is obviously just for single example but it handles multi-dimensional calculations in the same way as per the below simplified example:

>>> y_true = [[1,0],[0,1]]
>>> y_pred = [[0.95,0.03],[0.3,0.8]]
>>> K.mean(math_ops.squared_difference(y_pred, y_true))

<tf.Tensor: shape=(), dtype=float32, numpy=0.03335>

You can see that the result is a single number every time, and that's your loss.