I am very much novice at neural networks / machine learning. I am trying to learn more by using RotNet, a NN that will classify rotation angles in images. I am trying to train my network using the MNIST dataset, and have changed only one line of the repo (a log directory file path) but other than that have been able to run it successfully.
Here is how I am running it based on the README:
& .../Anaconda3/envs/tflow/python.exe .../RotNet/train/train_mnist.py
and then the output:
Using TensorFlow backend.
Input shape: (28, 28, 1)
60000 train samples
10000 test samples
2020-10-16 12:18:17.031214: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 28, 28, 1) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 26, 26, 64) 640
_________________________________________________________________
conv2d_2 (Conv2D) (None, 24, 24, 64) 36928
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 64) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 12, 12, 64) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 9216) 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 1179776
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 360) 46440
=================================================================
Total params: 1,263,784
Trainable params: 1,263,784
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
1/468 [..............................] - ETA: 2:21 - loss: 5.8862 - angle_error: 87.14062020-10-16 12:18:18.337183: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
469/468 [==============================] - 61s 130ms/step - loss: 5.0338 - angle_error: 81.4492 - val_loss: 4.1144 - val_angle_error: 65.9470
Epoch 2/50
469/468 [==============================] - 61s 131ms/step - loss: 4.3072 - angle_error: 64.7485 - val_loss: 3.4630 - val_angle_error: 53.0140
Epoch 3/50
469/468 [==============================] - 63s 134ms/step - loss: 4.0303 - angle_error: 56.3245 - val_loss: 3.2241 - val_angle_error: 47.0283
Epoch 4/50
469/468 [==============================] - 63s 134ms/step - loss: 3.8824 - angle_error: 52.2043 - val_loss: 3.3227 - val_angle_error: 43.2439
Epoch 5/50
469/468 [==============================] - 63s 135ms/step - loss: 3.7982 - angle_error: 49.9996 - val_loss: 3.1930 - val_angle_error: 41.1242
Epoch 6/50
469/468 [==============================] - 73s 155ms/step - loss: 3.7288 - angle_error: 48.4027 - val_loss: 2.9600 - val_angle_error: 39.9322
Epoch 7/50
469/468 [==============================] - 63s 133ms/step - loss: 3.6781 - angle_error: 46.5616 - val_loss: 3.2243 - val_angle_error: 38.6193
Epoch 8/50
469/468 [==============================] - 62s 132ms/step - loss: 3.6439 - angle_error: 45.2133 - val_loss: 2.8629 - val_angle_error: 38.0046
Epoch 9/50
469/468 [==============================] - 62s 132ms/step - loss: 3.6132 - angle_error: 44.7204 - val_loss: 3.0085 - val_angle_error: 37.4514
Epoch 10/50
469/468 [==============================] - 62s 132ms/step - loss: 3.5817 - angle_error: 43.8439 - val_loss: 3.0073 - val_angle_error: 35.8109
The script train_mnist.py is located here and it specifies 50 epochs. I am getting no error, the program simply stops after the 8th or 10th epoch. I am at a loss for how to fix this issue. Any advice would be appreciated!
I took a quick look at the code. In it there is this line:
callbacks=[checkpointer, early_stopping, tensorboard]
The call back early_stopping by default monitors the validation loss. The code used for early stopping is set such that if the validation loss fails to improve for more than 2 consecutive epochs training will halt. That is why it does not train for 50 epochs. If you want it to continue training for the full 50 remove early_stopping from the line of code above. You can see that early_stopping is causing the training to terminate by changing the code in the script from
early_stopping = EarlyStopping(patience=2)
# change code to
early_stopping = EarlyStopping(patience=2, verbose=1)
From the training data this model does not appear to be training very well. I suggest you try transfer learning with MobileNet. Code below shows how to use it,
mobile = tf.keras.applications.mobilenet.MobileNet( include_top=False, input_shape=(img_size, img_size,3), pooling='max', weights='imagenet', dropout=.5)
x=mobile.layers[-1].output # this is the last layer in the mobilenet model the global max pooling layer
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x=Dense(126, activation='relu')(x)
x=Dropout(rate=.3, seed = 123)(x)
predictions=Dense (len(classes), activation='softmax')(x)
model = Model(inputs=mobile.input, outputs=predictions)
Adapt the above to your situation it should work much better for layer in model.layers: layer.trainable=True model.compile(Adamax(lr=lr), loss='categorical_crossentropy', metrics=['accuracy'])