Why MXNet is reporting the incorrect validation accuracy?

I am new to MXNet and want to solve a simple example that uses 1 layer network to solve the digit classification problem. My program goes as follows:

import math
import numpy as np
import mxnet as mx
import matplotlib.pyplot as plt
import logging
logging.getLogger().setLevel(logging.DEBUG)
#============================================================
with np.load("notMNIST.npz") as data:

    images, labels = data["images"], data["labels"]

# Reshape the images from 28x28 into 784 1D-array and flaten the labels. 
images = images.reshape(784, 18720) labels = labels.reshape(18720)

# Apply one-hot encoding. 
Images = images.T.astype(np.float32) 
Labels = np.zeros((18720, 10)).astype(np.float32) 
Labels[np.arange(18720), labels] = 1

# Segment the data into training, evaluation and testing. 
X_train = Images[0 : 15000] 
y_train = Labels[0 : 15000]

X_eval = Images[15000 : 16000] 
y_eval = Labels[ 1200 :  2200] # IMPORTANT!!!

X_test = Images[16000 : 18720] 
y_test = Labels[16000 : 18720]

train_iter = mx.io.NDArrayIter(X_train, y_train, 100, shuffle=False)
_eval_iter = mx.io.NDArrayIter(X_eval , y_eval , 100, shuffle=False)
#============================================================
# Variables
X = mx.sym.Variable(name='data')

# Neural Network Layers
fully_connected_layer = mx.sym.FullyConnected(data=X, name='fc1', num_hidden=10)

# Outputs
lro = mx.sym.SoftmaxOutput(data=fully_connected_layer, name="softmax")
#============================================================

model = mx.mod.Module(symbol=lro)

model.fit(train_data=train_iter, eval_data=_eval_iter, 
          optimizer='sgd', optimizer_params={
              'learning_rate' : 1e-5, 
              'momentum' : 0.1}, 
          eval_metric="acc",
          num_epoch=500)

After running the program with evaluation label 15000 to 16000, the final step is reporting a validation accuracy of 97%, which I personally argue is too high for a 1-layer network. Therefore, I deliberately changed the evaluation labels to 1200 to 2200 and saw that the program is still reporting an accuracy at around 83~86% (at first I thought that maybe it is just a coincidence and tried several different evaluation labels but still got similar results).

What mistakes have I made in my program?

Solution

TLDR;

You can fix the problem, if you stop doing one-hot encoding.

Instead of passing Labels[0:15000], Labels[15000:16000] and Labels[16000:18720] pass labels[0:15000], labels[15000:16000] and labels[16000:18720].

This will decrease your accuracy to mediocre 0.796000 on proper evaluation labels, and down to 0.095000 on your "random" evaluation labels.

Detailed answer

You get such high accuracy due to a misleading calculation of mxnet.metric.Accuracy. Internally, Accuracy metric can work in 2 "modes" depending on shapes of provided arguments "preds" and "labels":

If shapes of "preds" and "labels" don't match, Accuracy interprets each row of the "preds" as probabilities of a sample to belong to each class. The class is defined as an item index in the array.

For example, if you have preds=[[0.1, 0.9], [0.8, 0.2]] then it means that:

1st example belongs to class 0 with 0.1 probability and to class 1 with 0.9 probability
2nd example belongs to class 0 with 0.8 probability and to class 1 with 0.2 probability

Working in this mode, "labels" are expected to be an array of real classes. In our case, imagining that the model is absolutely correct, the "labels" array should have been [1, 0].

2) If shapes of "preds" and "labels" do match, then Accuracy treats arrays as predicted classes and real classes. So each item is treated as a class of one sample. Then calculation is done as a comparison of items in "preds" "labels" with the same indices.

When you apply one-hot encoding to labels the second mode of calculation is used, because the shape of predictions from the model matches to the shape of one-hot encoding. Accuracy interprets each item in arrays as a standalone sample and compare them to each other.

Internally, Accuracy converts float array to int, which for floats less than 1 always produces 0. That behavior essentially convert all predictions to 0, except of a rare case when there is a class with 1.0 probability. So in the majority of cases we get preds = [0, 0, ..., 0].

One-hot encoding array has all items except of one equals to 0. Meaning we would have something like [0, 1, 0, ..., 0].

When Accuracy compares these two arrays, it founds that they are mostly equal, except of one place, returning back wrongly high accuracy.

Here is a simple reproducing example:

import mxnet as mx
predicts = mx.nd.array([[1.29206967e-09,   3.40120096e-05,   2.23299547e-12,   3.98692492e-07,
    1.21151755e-10,   2.59370694e-08,   1.95488334e-02,   1.13474562e-05,
    9.80405331e-01,   3.51648767e-12]])
labels = mx.nd.array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
acc = mx.metric.Accuracy()
acc.update(preds=predicts, labels=labels)
print(acc.get())

This will give us

('accuracy', 0.90000000000000002)

because one-hot encoding contains exactly 1 non-zero element.