I am quite new to the field of semantic segmentation and have recently tried to run the code provided on this paper: Transfer Learning for Brain Tumor Segmentation that was made available on GitHub. It is a semantic segmentation task that uses the BraTS2020 dataset, comprising of 4 modalities, T1, T1ce, T2 and FLAIR. The author utilised a transfer learning approach using Resnet34 weights.
Due to hardware constraints, I had to half the batch size from 24 to 12. However, after training the model, I noticed a significant drop in performance, with the Dice Score (higher is better) of the 3 classes being only around 5-19-11 as opposed to the reported result of 78-87-82 in the paper. The training and validation accuracies however, seem to be performing normally, just that the model does not perform well on test data, I selected the model that was produced before overfitting (validation loss starts increasing but training loss still decreasing) but yielded equally bad results.
So far I have tried:
I noticed that image augmentations were applied to the training and validation dataset to increase the robustness of the model training. Do these augmentations need to be performed on the test set in order to make predictions? There are no resizing transforms, transforms that are present are Gaussian Blur and Noise, change in brightness intensity, rotations, elastic deformation, and mirroring, all implemented using the example here.
I'd greatly appreciate help on these questions:
By doubling the number of batches per epoch, it effectively matches the number of iterations performed as in the original paper since the batch size is halved. Is this the correct approach?
Does the test set data need to be augmented similarly to the training data in order to perform predictions? (Note: no resizing transformations were performed)