tensorflow conv-neural-network autoencoder batchsize

Why Tensorflow GPU is not working with larger batch sizes?

I am training an Auto-encoder network on Tensorflow GPU 1.13.1. Initially, I used the batch size 32/64/128 but it seems the GPU is not being used at all. Although, "memory-usage" from "nvidia-smi returns the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |  31316MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

And, the training stops at 39th steps every time.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_3 (InputLayer)         (None, 256, 256, 3)       0
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 64, 64, 96)        34944
_________________________________________________________________
batch_normalization_6 (Batch (None, 64, 64, 96)        384
_________________________________________________________________
activation_6 (Activation)    (None, 64, 64, 96)        0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 96)        0
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 31, 31, 256)       614656
_________________________________________________________________
batch_normalization_7 (Batch (None, 31, 31, 256)       1024
_________________________________________________________________
activation_7 (Activation)    (None, 31, 31, 256)       0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 256)       0
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 15, 15, 384)       885120
_________________________________________________________________
batch_normalization_8 (Batch (None, 15, 15, 384)       1536
_________________________________________________________________
activation_8 (Activation)    (None, 15, 15, 384)       0
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 15, 15, 384)       1327488
_________________________________________________________________
batch_normalization_9 (Batch (None, 15, 15, 384)       1536
_________________________________________________________________
activation_9 (Activation)    (None, 15, 15, 384)       0
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 15, 15, 256)       884992
_________________________________________________________________
batch_normalization_10 (Batc (None, 15, 15, 256)       1024
_________________________________________________________________
activation_10 (Activation)   (None, 15, 15, 256)       0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 256)         0
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 1, 1, 1024)        12846080
_________________________________________________________________
batch_normalization_11 (Batc (None, 1, 1, 1024)        4096
_________________________________________________________________
encoded (Activation)         (None, 1, 1, 1024)        0
_________________________________________________________________
reshape_1 (Reshape)          (None, 2, 2, 256)         0
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 4, 4, 128)         819328
_________________________________________________________________
activation_11 (Activation)   (None, 4, 4, 128)         0
_________________________________________________________________
conv2d_transpose_2 (Conv2DTr (None, 8, 8, 64)          204864
_________________________________________________________________
activation_12 (Activation)   (None, 8, 8, 64)          0
_________________________________________________________________
conv2d_transpose_3 (Conv2DTr (None, 16, 16, 32)        51232
_________________________________________________________________
activation_13 (Activation)   (None, 16, 16, 32)        0
_________________________________________________________________
conv2d_transpose_4 (Conv2DTr (None, 32, 32, 16)        12816
_________________________________________________________________
activation_14 (Activation)   (None, 32, 32, 16)        0
_________________________________________________________________
conv2d_transpose_5 (Conv2DTr (None, 64, 64, 8)         3208
_________________________________________________________________
activation_15 (Activation)   (None, 64, 64, 8)         0
_________________________________________________________________
conv2d_transpose_6 (Conv2DTr (None, 128, 128, 4)       804
_________________________________________________________________
activation_16 (Activation)   (None, 128, 128, 4)       0
_________________________________________________________________
conv2d_transpose_7 (Conv2DTr (None, 256, 256, 3)       303
=================================================================
Total params: 17,695,435
Trainable params: 17,690,635
Non-trainable params: 4,800
_________________________________________________________________
Epoch 1/1
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.

   1/1382 [..............................] - ETA: 19:43:47 - loss: 0.6934 - accuracy: 0.1511
   2/1382 [..............................] - ETA: 10:04:16 - loss: 0.6933 - accuracy: 0.1545
   3/1382 [..............................] - ETA: 7:28:21 - loss: 0.6933 - accuracy: 0.1571
   4/1382 [..............................] - ETA: 6:07:30 - loss: 0.6932 - accuracy: 0.1590
   5/1382 [..............................] - ETA: 5:21:58 - loss: 0.6931 - accuracy: 0.1614
   6/1382 [..............................] - ETA: 4:55:45 - loss: 0.6930 - accuracy: 0.1648
   7/1382 [..............................] - ETA: 4:32:58 - loss: 0.6929 - accuracy: 0.1668
   8/1382 [..............................] - ETA: 4:15:07 - loss: 0.6929 - accuracy: 0.1692
   9/1382 [..............................] - ETA: 4:02:22 - loss: 0.6928 - accuracy: 0.1726
  10/1382 [..............................] - ETA: 3:50:11 - loss: 0.6926 - accuracy: 0.1745
  11/1382 [..............................] - ETA: 3:39:13 - loss: 0.6925 - accuracy: 0.1769
  12/1382 [..............................] - ETA: 3:29:38 - loss: 0.6924 - accuracy: 0.1797
  13/1382 [..............................] - ETA: 3:21:11 - loss: 0.6923 - accuracy: 0.1824
  14/1382 [..............................] - ETA: 3:13:42 - loss: 0.6922 - accuracy: 0.1845
  15/1382 [..............................] - ETA: 3:07:17 - loss: 0.6920 - accuracy: 0.1871
  16/1382 [..............................] - ETA: 3:01:59 - loss: 0.6919 - accuracy: 0.1896
  17/1382 [..............................] - ETA: 2:57:36 - loss: 0.6918 - accuracy: 0.1916
  18/1382 [..............................] - ETA: 2:53:06 - loss: 0.6917 - accuracy: 0.1938
  19/1382 [..............................] - ETA: 2:49:37 - loss: 0.6915 - accuracy: 0.1956
  20/1382 [..............................] - ETA: 2:45:51 - loss: 0.6915 - accuracy: 0.1979
  21/1382 [..............................] - ETA: 2:43:18 - loss: 0.6914 - accuracy: 0.2000
  22/1382 [..............................] - ETA: 2:41:02 - loss: 0.6913 - accuracy: 0.2022
  23/1382 [..............................] - ETA: 2:39:23 - loss: 0.6912 - accuracy: 0.2039
  24/1382 [..............................] - ETA: 2:37:23 - loss: 0.6911 - accuracy: 0.2060
  25/1382 [..............................] - ETA: 2:35:58 - loss: 0.6909 - accuracy: 0.2080
  26/1382 [..............................] - ETA: 2:34:06 - loss: 0.6909 - accuracy: 0.2098
  27/1382 [..............................] - ETA: 2:33:19 - loss: 0.6908 - accuracy: 0.2115
  28/1382 [..............................] - ETA: 2:32:24 - loss: 0.6906 - accuracy: 0.2130
  29/1382 [..............................] - ETA: 2:31:43 - loss: 0.6904 - accuracy: 0.2143
  30/1382 [..............................] - ETA: 2:31:09 - loss: 0.6904 - accuracy: 0.2157
  31/1382 [..............................] - ETA: 2:30:34 - loss: 0.6902 - accuracy: 0.2173
  32/1382 [..............................] - ETA: 2:29:26 - loss: 0.6901 - accuracy: 0.2185
  33/1382 [..............................] - ETA: 2:28:55 - loss: 0.6900 - accuracy: 0.2199
  34/1382 [..............................] - ETA: 2:28:05 - loss: 0.6899 - accuracy: 0.2213
  35/1382 [..............................] - ETA: 2:27:23 - loss: 0.6898 - accuracy: 0.2227
  36/1382 [..............................] - ETA: 2:27:02 - loss: 0.6897 - accuracy: 0.2238
  37/1382 [..............................] - ETA: 2:26:56 - loss: 0.6895 - accuracy: 0.2253
  38/1382 [..............................] - ETA: 2:26:32 - loss: 0.6893 - accuracy: 0.2266
  39/1382 [..............................] - ETA: 2:26:11 - loss: 0.6891 - accuracy: 0.2278

Even waiting hours, the training process doesn't move further.

Another, unusual thing I noticed is that, setting the batch size to "1", the GPU is being continuously utilized.

What could be the problem?

Solution

This might be an issue with the drive where you placed the dataset. The code was working fine everywhere but not on this server. I changed the drive (from one NFS share to another) and everything works well.