Hello World aka. MNIST with feed forward gets less accuracy in comparison of plain with DistributedDataParallel (DDP) model with only one node

This is a cross-post to my question in the Pytorch forum.

When using DistributedDataParallel (DDP) from PyTorch on only one node I expect it to be the same as a script without DistributedDataParallel.

I created a simple MNIST training setup with a three-layer feed forward neural network. It gives significantly lower accuracy (around 10%) if trained with the same hyperparameters, same epochs, and generally same code but the usage of the DDP library.

I created a GitHub repository demonstrating my problem.

I hope it is a usage error of the library, but I do not see how there is a problem, also colleges of mine did already audit the code. Also, I tried it on macOS with a CPU and on three different GPU/ubuntu combinations (one with a 1080-Ti, one with a 2080-Ti and a cluster with P100s inside) all giving the same results. Seeds are fixed for reproducibility.

Solution

You are using different batch sizes in your two experiments: batch_size=128, and batch_size=32 for mnist-distributed.py and mnist-plain.py respectively. This would indicate that you won't have the same performance result with those two trainings.