I am training an Auto-encoder network on Tensorflow GPU 1.13.1. Initially, I used the batch size 32/64/128 but it seems the GPU is not being used at all. Although, "memory-usage" from "nvidia-smi returns the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 31316MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
And, the training stops at 39th steps every time.
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_3 (InputLayer) (None, 256, 256, 3) 0
_________________________________________________________________
conv2d_6 (Conv2D) (None, 64, 64, 96) 34944
_________________________________________________________________
batch_normalization_6 (Batch (None, 64, 64, 96) 384
_________________________________________________________________
activation_6 (Activation) (None, 64, 64, 96) 0
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 96) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 31, 31, 256) 614656
_________________________________________________________________
batch_normalization_7 (Batch (None, 31, 31, 256) 1024
_________________________________________________________________
activation_7 (Activation) (None, 31, 31, 256) 0
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 256) 0
_________________________________________________________________
conv2d_8 (Conv2D) (None, 15, 15, 384) 885120
_________________________________________________________________
batch_normalization_8 (Batch (None, 15, 15, 384) 1536
_________________________________________________________________
activation_8 (Activation) (None, 15, 15, 384) 0
_________________________________________________________________
conv2d_9 (Conv2D) (None, 15, 15, 384) 1327488
_________________________________________________________________
batch_normalization_9 (Batch (None, 15, 15, 384) 1536
_________________________________________________________________
activation_9 (Activation) (None, 15, 15, 384) 0
_________________________________________________________________
conv2d_10 (Conv2D) (None, 15, 15, 256) 884992
_________________________________________________________________
batch_normalization_10 (Batc (None, 15, 15, 256) 1024
_________________________________________________________________
activation_10 (Activation) (None, 15, 15, 256) 0
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 256) 0
_________________________________________________________________
conv2d_11 (Conv2D) (None, 1, 1, 1024) 12846080
_________________________________________________________________
batch_normalization_11 (Batc (None, 1, 1, 1024) 4096
_________________________________________________________________
encoded (Activation) (None, 1, 1, 1024) 0
_________________________________________________________________
reshape_1 (Reshape) (None, 2, 2, 256) 0
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 4, 4, 128) 819328
_________________________________________________________________
activation_11 (Activation) (None, 4, 4, 128) 0
_________________________________________________________________
conv2d_transpose_2 (Conv2DTr (None, 8, 8, 64) 204864
_________________________________________________________________
activation_12 (Activation) (None, 8, 8, 64) 0
_________________________________________________________________
conv2d_transpose_3 (Conv2DTr (None, 16, 16, 32) 51232
_________________________________________________________________
activation_13 (Activation) (None, 16, 16, 32) 0
_________________________________________________________________
conv2d_transpose_4 (Conv2DTr (None, 32, 32, 16) 12816
_________________________________________________________________
activation_14 (Activation) (None, 32, 32, 16) 0
_________________________________________________________________
conv2d_transpose_5 (Conv2DTr (None, 64, 64, 8) 3208
_________________________________________________________________
activation_15 (Activation) (None, 64, 64, 8) 0
_________________________________________________________________
conv2d_transpose_6 (Conv2DTr (None, 128, 128, 4) 804
_________________________________________________________________
activation_16 (Activation) (None, 128, 128, 4) 0
_________________________________________________________________
conv2d_transpose_7 (Conv2DTr (None, 256, 256, 3) 303
=================================================================
Total params: 17,695,435
Trainable params: 17,690,635
Non-trainable params: 4,800
_________________________________________________________________
Epoch 1/1
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
1/1382 [..............................] - ETA: 19:43:47 - loss: 0.6934 - accuracy: 0.1511
2/1382 [..............................] - ETA: 10:04:16 - loss: 0.6933 - accuracy: 0.1545
3/1382 [..............................] - ETA: 7:28:21 - loss: 0.6933 - accuracy: 0.1571
4/1382 [..............................] - ETA: 6:07:30 - loss: 0.6932 - accuracy: 0.1590
5/1382 [..............................] - ETA: 5:21:58 - loss: 0.6931 - accuracy: 0.1614
6/1382 [..............................] - ETA: 4:55:45 - loss: 0.6930 - accuracy: 0.1648
7/1382 [..............................] - ETA: 4:32:58 - loss: 0.6929 - accuracy: 0.1668
8/1382 [..............................] - ETA: 4:15:07 - loss: 0.6929 - accuracy: 0.1692
9/1382 [..............................] - ETA: 4:02:22 - loss: 0.6928 - accuracy: 0.1726
10/1382 [..............................] - ETA: 3:50:11 - loss: 0.6926 - accuracy: 0.1745
11/1382 [..............................] - ETA: 3:39:13 - loss: 0.6925 - accuracy: 0.1769
12/1382 [..............................] - ETA: 3:29:38 - loss: 0.6924 - accuracy: 0.1797
13/1382 [..............................] - ETA: 3:21:11 - loss: 0.6923 - accuracy: 0.1824
14/1382 [..............................] - ETA: 3:13:42 - loss: 0.6922 - accuracy: 0.1845
15/1382 [..............................] - ETA: 3:07:17 - loss: 0.6920 - accuracy: 0.1871
16/1382 [..............................] - ETA: 3:01:59 - loss: 0.6919 - accuracy: 0.1896
17/1382 [..............................] - ETA: 2:57:36 - loss: 0.6918 - accuracy: 0.1916
18/1382 [..............................] - ETA: 2:53:06 - loss: 0.6917 - accuracy: 0.1938
19/1382 [..............................] - ETA: 2:49:37 - loss: 0.6915 - accuracy: 0.1956
20/1382 [..............................] - ETA: 2:45:51 - loss: 0.6915 - accuracy: 0.1979
21/1382 [..............................] - ETA: 2:43:18 - loss: 0.6914 - accuracy: 0.2000
22/1382 [..............................] - ETA: 2:41:02 - loss: 0.6913 - accuracy: 0.2022
23/1382 [..............................] - ETA: 2:39:23 - loss: 0.6912 - accuracy: 0.2039
24/1382 [..............................] - ETA: 2:37:23 - loss: 0.6911 - accuracy: 0.2060
25/1382 [..............................] - ETA: 2:35:58 - loss: 0.6909 - accuracy: 0.2080
26/1382 [..............................] - ETA: 2:34:06 - loss: 0.6909 - accuracy: 0.2098
27/1382 [..............................] - ETA: 2:33:19 - loss: 0.6908 - accuracy: 0.2115
28/1382 [..............................] - ETA: 2:32:24 - loss: 0.6906 - accuracy: 0.2130
29/1382 [..............................] - ETA: 2:31:43 - loss: 0.6904 - accuracy: 0.2143
30/1382 [..............................] - ETA: 2:31:09 - loss: 0.6904 - accuracy: 0.2157
31/1382 [..............................] - ETA: 2:30:34 - loss: 0.6902 - accuracy: 0.2173
32/1382 [..............................] - ETA: 2:29:26 - loss: 0.6901 - accuracy: 0.2185
33/1382 [..............................] - ETA: 2:28:55 - loss: 0.6900 - accuracy: 0.2199
34/1382 [..............................] - ETA: 2:28:05 - loss: 0.6899 - accuracy: 0.2213
35/1382 [..............................] - ETA: 2:27:23 - loss: 0.6898 - accuracy: 0.2227
36/1382 [..............................] - ETA: 2:27:02 - loss: 0.6897 - accuracy: 0.2238
37/1382 [..............................] - ETA: 2:26:56 - loss: 0.6895 - accuracy: 0.2253
38/1382 [..............................] - ETA: 2:26:32 - loss: 0.6893 - accuracy: 0.2266
39/1382 [..............................] - ETA: 2:26:11 - loss: 0.6891 - accuracy: 0.2278
Even waiting hours, the training process doesn't move further.
Another, unusual thing I noticed is that, setting the batch size to "1", the GPU is being continuously utilized.
What could be the problem?
This might be an issue with the drive where you placed the dataset. The code was working fine everywhere but not on this server. I changed the drive (from one NFS share to another) and everything works well.