In training a neural network in Tensorflow 2.0 in python, I'm noticing that training accuracy and loss change dramatically between epochs. I'm aware that the metrics printed are an average over the entire epoch, but accuracy seems to drop significantly after each epoch, despite the average always increasing.
The loss also exhibits this behavior, dropping significantly each epoch but the average increases. Here is an image of what I mean (from Tensorboard):
I've noticed this behavior on all of the models I've implemented myself, so it could be a bug, but I want a second opinion on whether this is normal behavior and if so what does it mean?
Also, I'm using a fairly large dataset (roughly 3 million examples). Batch size is 32 and each dot in the accuracy/loss graphs represent 50 batches (2k on the graph = 100k batches). The learning rate graph is 1:1 for batches.
It seems this phenomenon comes from the fact that the model has a high batch-to-batch variance in terms of accuracy and loss. This is illustrated if I take a graph of the model with the actual metrics per step as opposed to the average over the epoch:
Here you can see that the model can vary widely. (This graph is just for one epoch, but the fact remains).
Since the average metrics were being reported per epoch, at the beginning of the next epoch it is highly likely that the average metrics will be lower than the previous average, leading to a dramatic drop in the running average value, illustrated in red below:
If you imagine the discontinuities in the red graph as being epoch transitions, you can see why you would observe the phenomenon in the question.
TL;DR The model has a very high variance in it's output with respect to each batch.