Multi-GPU training does not reduce training time

I have tried training three UNet models using keras for image segmentation to assess the effect of multi-GPU training.

First model was trained using 1 batch size on 1 GPU (P100). Each training step took ~254ms. (Note it is step, not epoch).
Second model was trained using 2 batch size using 1 GPU (P100). Each training step took ~399ms.
Third model was trained using 2 batch size using 2 GPUs (P100). Each training step took ~370ms. Logically it should have taken the same time as the first case, since both GPUs process 1 batch in parallel but it took more time.

Anyone who can tell whether multi-GPU training results in reduced training time or not? For reference, I tried all the models using keras.

Solution

I presume that this is due to the fact that you use a very small batch_size; in this case, the cost of distributing the gradients/computations over two GPUs and fetching them back (as well as CPU to GPU(2) data distribution) outweigh the parallel time advantage that you might gain versus the sequential training(on 1 GPU).

Expect to see a bigger difference for a batch size of 8/16 for instance.