tensorflow deep-learning jupyter-notebook loss-function

Sudden increase and stall in loss function

I am currently working on a Deep Learning project involving the training of a U-net to perform image registration. The goal of the net is to deform an image (which i call "moving image") to match the feature shape of another one (the fixed image).

I've run the code several times and it appears that the loss suddenly jumps from one epoch to another, before stalling. This happens randomly, sometimes at epoch 24, sometimes 60, etc.

I'm running the net on a RTX3050 via WSL on vscode. The code is the following https://github.com/jacopoaltieri/Anatomical_landmarks_eval_CNN/blob/main/unet_jupyter_trials.ipynb

How can i fix this issue? and, most importantly, what causes it in the first place?

I will provide some more images of what happens below:

https://imgur.com/a/0krv9vA

Solution

I managed to make it work by implementing a callback that reduces the learning rate on Plateau. For completeness, also the implementation of an AMSGrad optimizer should work, but it achieves worse results.

The problem was caused by how Adam handles the increasingly small denominators, as suggested here