I am running a 3D image segmentation deep learning training pipeline on a GCloud VM and am noticing a stepwise decrease of the GPU utility after about 25 epochs and an out of memory error after 32 epochs. Since such a training pipeline is basically the same loop over the data over and over again, and since all other main metrics do not show such a pattern change, I don't understand why the first epochs are fine and it then suddenly occurs.
Could this be some kind of memory leak on the GPU? Could GCloud apply some kind of throttling based on the GPU temperature?
Some context info:
Some things I've tried:
I can reproduce the behaviour reliably, it happens every time after the same amount of epochs
If I decrease the input image size, it still happens but later (epoch 60)
If I decrease the model size it happens earlier (this I especially don't understand)
I've set JULIA_CUDA_MEMORY_POOL
to none
and added a callback after each epoch that executes GC.gc(true)
and CUDA.reclaim()
The problem was resolved by changing my optimiser from Flux.Nesterov
to Optimisers.Nesterov
as suggested here. Apparently the Flux optimisers gathers some kind of state whereas the ones from Optimisers.jl do not.