neural-network conv-neural-network image-segmentation training-data gradient-descent

What are differences between 1) training a CNN from whole training set and 2) training from a subset of training set and then whole training set?

I am working on training a segmentation network U-net on the LIDC-IDRI dataset. There are currently two training strategies:

Train the model on the whole training set from scratch (40k steps, 180k steps).
Train the model on 10% of the whole training set. After convergence (30k steps), continue to train the model on the whole training set (10k steps).

With Dice coefficient as loss function, which is also used in V-net architecture (paper), model trained with Method 2 is always better than that with Method 1. The former can achieve a Dice score of 0.735, while the latter can only reach 0.71.

BTW, my U-net model is implemented in TensorFlow, and the model is trained on NVidia GTX 1080Ti

Could anyone give some explanation or references. Thanks!

Solution

Well, I read your answer and decided to try it, as it was fairly easy, as I've also been training Vnets on LIDC-IDRI. Usually I train on the whole dataset from the beginning. Option 2) gave faster boost in dice, however, soon it fell to 2% on validation and even after enabling the network to learn the whole dataset it did not recover, training dice. of course, was increasing. Seems my 10% of dataset were not quite representative and it badly overfit.