I am training an Auto-encoder network on Tensorflow GPU 1.13.1. Initially, I used the batch size 32/64/128 but it seems the GPU is not being used at all. Although, "memory-usage" from "nvidia-smi returns the following:
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 31316MiB / 32480MiB | 0% Default |
And, the training stops at 39th steps every time.
Model: "model_1"
Layer (type) Output Shape Param #
input_3 (InputLayer) (None, 256, 256, 3) 0
conv2d_6 (Conv2D) (None, 64, 64, 96) 34944
batch_normalization_6 (Batch (None, 64, 64, 96) 384
activation_6 (Activation) (None, 64, 64, 96) 0
max_pooling2d_4 (MaxPooling2 (None, 31, 31, 96) 0
conv2d_7 (Conv2D) (None, 31, 31, 256) 614656
batch_normalization_7 (Batch (None, 31, 31, 256) 1024
activation_7 (Activation) (None, 31, 31, 256) 0
max_pooling2d_5 (MaxPooling2 (None, 15, 15, 256) 0
conv2d_8 (Conv2D) (None, 15, 15, 384) 885120
batch_normalization_8 (Batch (None, 15, 15, 384) 1536
activation_8 (Activation) (None, 15, 15, 384) 0
conv2d_9 (Conv2D) (None, 15, 15, 384) 1327488
batch_normalization_9 (Batch (None, 15, 15, 384) 1536
activation_9 (Activation) (None, 15, 15, 384) 0
conv2d_10 (Conv2D) (None, 15, 15, 256) 884992
batch_normalization_10 (Batc (None, 15, 15, 256) 1024
activation_10 (Activation) (None, 15, 15, 256) 0
max_pooling2d_6 (MaxPooling2 (None, 7, 7, 256) 0
conv2d_11 (Conv2D) (None, 1, 1, 1024) 12846080
batch_normalization_11 (Batc (None, 1, 1, 1024) 4096
encoded (Activation) (None, 1, 1, 1024) 0
reshape_1 (Reshape) (None, 2, 2, 256) 0
conv2d_transpose_1 (Conv2DTr (None, 4, 4, 128) 819328
activation_11 (Activation) (None, 4, 4, 128) 0
conv2d_transpose_2 (Conv2DTr (None, 8, 8, 64) 204864
activation_12 (Activation) (None, 8, 8, 64) 0
conv2d_transpose_3 (Conv2DTr (None, 16, 16, 32) 51232
activation_13 (Activation) (None, 16, 16, 32) 0
conv2d_transpose_4 (Conv2DTr (None, 32, 32, 16) 12816
activation_14 (Activation) (None, 32, 32, 16) 0
conv2d_transpose_5 (Conv2DTr (None, 64, 64, 8) 3208
activation_15 (Activation) (None, 64, 64, 8) 0
conv2d_transpose_6 (Conv2DTr (None, 128, 128, 4) 804
activation_16 (Activation) (None, 128, 128, 4) 0
conv2d_transpose_7 (Conv2DTr (None, 256, 256, 3) 303
Total params: 17,695,435
Trainable params: 17,690,635
Non-trainable params: 4,800
Epoch 1/1
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 11058 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
Found 44234 images belonging to 1 classes.
1/1382 [..............................] - ETA: 19:43:47 - loss: 0.6934 - accuracy: 0.1511
2/1382 [..............................] - ETA: 10:04:16 - loss: 0.6933 - accuracy: 0.1545
3/1382 [..............................] - ETA: 7:28:21 - loss: 0.6933 - accuracy: 0.1571
4/1382 [..............................] - ETA: 6:07:30 - loss: 0.6932 - accuracy: 0.1590
5/1382 [..............................] - ETA: 5:21:58 - loss: 0.6931 - accuracy: 0.1614
6/1382 [..............................] - ETA: 4:55:45 - loss: 0.6930 - accuracy: 0.1648
7/1382 [..............................] - ETA: 4:32:58 - loss: 0.6929 - accuracy: 0.1668
8/1382 [..............................] - ETA: 4:15:07 - loss: 0.6929 - accuracy: 0.1692
9/1382 [..............................] - ETA: 4:02:22 - loss: 0.6928 - accuracy: 0.1726
10/1382 [..............................] - ETA: 3:50:11 - loss: 0.6926 - accuracy: 0.1745
11/1382 [..............................] - ETA: 3:39:13 - loss: 0.6925 - accuracy: 0.1769
12/1382 [..............................] - ETA: 3:29:38 - loss: 0.6924 - accuracy: 0.1797
13/1382 [..............................] - ETA: 3:21:11 - loss: 0.6923 - accuracy: 0.1824
14/1382 [..............................] - ETA: 3:13:42 - loss: 0.6922 - accuracy: 0.1845
15/1382 [..............................] - ETA: 3:07:17 - loss: 0.6920 - accuracy: 0.1871
16/1382 [..............................] - ETA: 3:01:59 - loss: 0.6919 - accuracy: 0.1896
17/1382 [..............................] - ETA: 2:57:36 - loss: 0.6918 - accuracy: 0.1916
18/1382 [..............................] - ETA: 2:53:06 - loss: 0.6917 - accuracy: 0.1938
19/1382 [..............................] - ETA: 2:49:37 - loss: 0.6915 - accuracy: 0.1956
20/1382 [..............................] - ETA: 2:45:51 - loss: 0.6915 - accuracy: 0.1979
21/1382 [..............................] - ETA: 2:43:18 - loss: 0.6914 - accuracy: 0.2000
22/1382 [..............................] - ETA: 2:41:02 - loss: 0.6913 - accuracy: 0.2022
23/1382 [..............................] - ETA: 2:39:23 - loss: 0.6912 - accuracy: 0.2039
24/1382 [..............................] - ETA: 2:37:23 - loss: 0.6911 - accuracy: 0.2060
25/1382 [..............................] - ETA: 2:35:58 - loss: 0.6909 - accuracy: 0.2080
26/1382 [..............................] - ETA: 2:34:06 - loss: 0.6909 - accuracy: 0.2098
27/1382 [..............................] - ETA: 2:33:19 - loss: 0.6908 - accuracy: 0.2115
28/1382 [..............................] - ETA: 2:32:24 - loss: 0.6906 - accuracy: 0.2130
29/1382 [..............................] - ETA: 2:31:43 - loss: 0.6904 - accuracy: 0.2143
30/1382 [..............................] - ETA: 2:31:09 - loss: 0.6904 - accuracy: 0.2157
31/1382 [..............................] - ETA: 2:30:34 - loss: 0.6902 - accuracy: 0.2173
32/1382 [..............................] - ETA: 2:29:26 - loss: 0.6901 - accuracy: 0.2185
33/1382 [..............................] - ETA: 2:28:55 - loss: 0.6900 - accuracy: 0.2199
34/1382 [..............................] - ETA: 2:28:05 - loss: 0.6899 - accuracy: 0.2213
35/1382 [..............................] - ETA: 2:27:23 - loss: 0.6898 - accuracy: 0.2227
36/1382 [..............................] - ETA: 2:27:02 - loss: 0.6897 - accuracy: 0.2238
37/1382 [..............................] - ETA: 2:26:56 - loss: 0.6895 - accuracy: 0.2253
38/1382 [..............................] - ETA: 2:26:32 - loss: 0.6893 - accuracy: 0.2266
39/1382 [..............................] - ETA: 2:26:11 - loss: 0.6891 - accuracy: 0.2278
Even waiting hours, the training process doesn't move further.
Another, unusual thing I noticed is that, setting the batch size to "1", the GPU is being continuously utilized.
What could be the problem?
This might be an issue with the drive where you placed the dataset. The code was working fine everywhere but not on this server. I changed the drive (from one NFS share to another) and everything works well.