Search code examples

Why Tensorflow GPU is not working with larger batch sizes?

I am training an Auto-encoder network on Tensorflow GPU 1.13.1. Initially, I used the batch size 32/64/128 but it seems the GPU is not being used at all. Although, "memory-usage" from "nvidia-smi returns the following:

| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |  31316MiB / 32480MiB |      0%      Default |

And, the training stops at 39th steps every time.

Model: "model_1"
Layer (type)                 Output Shape              Param #
input_3 (InputLayer)         (None, 256, 256, 3)       0
conv2d_6 (Conv2D)            (None, 64, 64, 96)        34944
batch_normalization_6 (Batch (None, 64, 64, 96)        384
activation_6 (Activation)    (None, 64, 64, 96)        0
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 96)        0
conv2d_7 (Conv2D)            (None, 31, 31, 256)       614656
batch_normalization_7 (Batch (None, 31, 31, 256)       1024
activation_7 (Activation)    (None, 31, 31, 256)       0
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 256)       0
conv2d_8 (Conv2D)            (None, 15, 15, 384)       885120
batch_normalization_8 (Batch (None, 15, 15, 384)       1536
activation_8 (Activation)    (None, 15, 15, 384)       0
conv2d_9 (Conv2D)            (None, 15, 15, 384)       1327488
batch_normalization_9 (Batch (None, 15, 15, 384)       1536
activation_9 (Activation)    (None, 15, 15, 384)       0
conv2d_10 (Conv2D)           (None, 15, 15, 256)       884992
batch_normalization_10 (Batc (None, 15, 15, 256)       1024
activation_10 (Activation)   (None, 15, 15, 256)       0
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 256)         0
conv2d_11 (Conv2D)           (None, 1, 1, 1024)        12846080
batch_normalization_11 (Batc (None, 1, 1, 1024)        4096
encoded (Activation)         (None, 1, 1, 1024)        0
reshape_1 (Reshape)          (None, 2, 2, 256)         0
conv2d_transpose_1 (Conv2DTr (None, 4, 4, 128)         819328
activation_11 (Activation)   (None, 4, 4, 128)         0
conv2d_transpose_2 (Conv2DTr (None, 8, 8, 64)          204864
activation_12 (Activation)   (None, 8, 8, 64)          0
conv2d_transpose_3 (Conv2DTr (None, 16, 16, 32)        51232
activation_13 (Activation)   (None, 16, 16, 32)        0
conv2d_transpose_4 (Conv2DTr (None, 32, 32, 16)        12816
activation_14 (Activation)   (None, 32, 32, 16)        0
conv2d_transpose_5 (Conv2DTr (None, 64, 64, 8)         3208
activation_15 (Activation)   (None, 64, 64, 8)         0
conv2d_transpose_6 (Conv2DTr (None, 128, 128, 4)       804
activation_16 (Activation)   (None, 128, 128, 4)       0
conv2d_transpose_7 (Conv2DTr (None, 256, 256, 3)       303
Total params: 17,695,435
Trainable params: 17,690,635
Non-trainable params: 4,800
Epoch 1/1
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.

   1/1382 [..............................] - ETA: 19:43:47 - loss: 0.6934 - accuracy: 0.1511
   2/1382 [..............................] - ETA: 10:04:16 - loss: 0.6933 - accuracy: 0.1545
   3/1382 [..............................] - ETA: 7:28:21 - loss: 0.6933 - accuracy: 0.1571
   4/1382 [..............................] - ETA: 6:07:30 - loss: 0.6932 - accuracy: 0.1590
   5/1382 [..............................] - ETA: 5:21:58 - loss: 0.6931 - accuracy: 0.1614
   6/1382 [..............................] - ETA: 4:55:45 - loss: 0.6930 - accuracy: 0.1648
   7/1382 [..............................] - ETA: 4:32:58 - loss: 0.6929 - accuracy: 0.1668
   8/1382 [..............................] - ETA: 4:15:07 - loss: 0.6929 - accuracy: 0.1692
   9/1382 [..............................] - ETA: 4:02:22 - loss: 0.6928 - accuracy: 0.1726
  10/1382 [..............................] - ETA: 3:50:11 - loss: 0.6926 - accuracy: 0.1745
  11/1382 [..............................] - ETA: 3:39:13 - loss: 0.6925 - accuracy: 0.1769
  12/1382 [..............................] - ETA: 3:29:38 - loss: 0.6924 - accuracy: 0.1797
  13/1382 [..............................] - ETA: 3:21:11 - loss: 0.6923 - accuracy: 0.1824
  14/1382 [..............................] - ETA: 3:13:42 - loss: 0.6922 - accuracy: 0.1845
  15/1382 [..............................] - ETA: 3:07:17 - loss: 0.6920 - accuracy: 0.1871
  16/1382 [..............................] - ETA: 3:01:59 - loss: 0.6919 - accuracy: 0.1896
  17/1382 [..............................] - ETA: 2:57:36 - loss: 0.6918 - accuracy: 0.1916
  18/1382 [..............................] - ETA: 2:53:06 - loss: 0.6917 - accuracy: 0.1938
  19/1382 [..............................] - ETA: 2:49:37 - loss: 0.6915 - accuracy: 0.1956
  20/1382 [..............................] - ETA: 2:45:51 - loss: 0.6915 - accuracy: 0.1979
  21/1382 [..............................] - ETA: 2:43:18 - loss: 0.6914 - accuracy: 0.2000
  22/1382 [..............................] - ETA: 2:41:02 - loss: 0.6913 - accuracy: 0.2022
  23/1382 [..............................] - ETA: 2:39:23 - loss: 0.6912 - accuracy: 0.2039
  24/1382 [..............................] - ETA: 2:37:23 - loss: 0.6911 - accuracy: 0.2060
  25/1382 [..............................] - ETA: 2:35:58 - loss: 0.6909 - accuracy: 0.2080
  26/1382 [..............................] - ETA: 2:34:06 - loss: 0.6909 - accuracy: 0.2098
  27/1382 [..............................] - ETA: 2:33:19 - loss: 0.6908 - accuracy: 0.2115
  28/1382 [..............................] - ETA: 2:32:24 - loss: 0.6906 - accuracy: 0.2130
  29/1382 [..............................] - ETA: 2:31:43 - loss: 0.6904 - accuracy: 0.2143
  30/1382 [..............................] - ETA: 2:31:09 - loss: 0.6904 - accuracy: 0.2157
  31/1382 [..............................] - ETA: 2:30:34 - loss: 0.6902 - accuracy: 0.2173
  32/1382 [..............................] - ETA: 2:29:26 - loss: 0.6901 - accuracy: 0.2185
  33/1382 [..............................] - ETA: 2:28:55 - loss: 0.6900 - accuracy: 0.2199
  34/1382 [..............................] - ETA: 2:28:05 - loss: 0.6899 - accuracy: 0.2213
  35/1382 [..............................] - ETA: 2:27:23 - loss: 0.6898 - accuracy: 0.2227
  36/1382 [..............................] - ETA: 2:27:02 - loss: 0.6897 - accuracy: 0.2238
  37/1382 [..............................] - ETA: 2:26:56 - loss: 0.6895 - accuracy: 0.2253
  38/1382 [..............................] - ETA: 2:26:32 - loss: 0.6893 - accuracy: 0.2266
  39/1382 [..............................] - ETA: 2:26:11 - loss: 0.6891 - accuracy: 0.2278

Even waiting hours, the training process doesn't move further.

Another, unusual thing I noticed is that, setting the batch size to "1", the GPU is being continuously utilized.

What could be the problem?


  • This might be an issue with the drive where you placed the dataset. The code was working fine everywhere but not on this server. I changed the drive (from one NFS share to another) and everything works well.