amazon-web-services plot metrics amazon-sagemaker custom-training

How to plot history of training metrics in Sagemaker .py training

I am running a notebook in Sagemaker and I use a .py file for training:

tf_estimator = TensorFlow(entry_point='train_cnn.py', 
                          role=role,
                          train_instance_count=1, 
                          train_instance_type='local',  #We use the local instance
                          framework_version='1.12', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={'epochs': 1} #One epoch just to check everything is ok
                         )

#We fit the model with the data
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})

In the train_cnn file I use a standard CNN. However, the last part of the file indicates to plot the history of the training like this:

model.compile(loss=tensorflow.keras.losses.binary_crossentropy,
            optimizer=Adam(lr=lr),
            metrics=['accuracy'])

snn=model.fit(train_images, train_labels, batch_size=batch_size,
            validation_data=(test_images, test_labels),
            epochs=epochs,
            verbose=2)

score = model.evaluate(test_images, test_labels, verbose=0)
print('Validation loss    :', score[0])
print('Validation accuracy:', score[1])
   
plt.figure(0)
plt.plot(snn.history['acc'], 'r')
plt.plot(snn.history['val_acc'], 'g')
plt.xticks(np.arange(0, 11, 2.0))  
plt.rcParams['figure.figsize'] = (8, 6)  
plt.xlabel("Num of Epochs")  
plt.ylabel("Accuracy")  
plt.title("Training Accuracy")  
plt.legend(['train', 'validation'])
plt.figure(1)  
plt.plot(snn.history['loss'],'r')  
plt.plot(snn.history['val_loss'],'g')  
plt.xticks(np.arange(0, 11, 2.0))  
plt.rcParams['figure.figsize'] = (8, 6)  
plt.xlabel("Num of Epochs")  
plt.ylabel("Loss")  
plt.title("Training Loss vs Validation Loss")  
plt.legend(['train','validation'])
plt.show()

However, there is nothing displayed and the training shows a succes. Maybe because is performed in another instance. Here the displayed information:

Epoch 1/1
algo-1-tn2vd_1  |  - 2s - loss: 0.8858 - acc: 0.4615 - val_loss: 3.0894 - val_acc: 0.5000
algo-1-tn2vd_1  | Validation loss    : 3.0894343852996826
algo-1-tn2vd_1  | Validation accuracy: 0.5
algo-1-tn2vd_1  | WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/simple_save.py:85: calling SavedModelBuilder.add_meta_graph_and_variables (from tensorflow.python.saved_model.builder_impl) with legacy_init_op is deprecated and will be removed in a future version.
algo-1-tn2vd_1  | Instructions for updating:
algo-1-tn2vd_1  | Pass your op to the equivalent parameter main_op instead.
algo-1-tn2vd_1  | 2020-07-12 00:42:23,538 sagemaker-containers INFO     Reporting training SUCCESS
tmpuzys_qpc_algo-1-tn2vd_1 exited with code 0
Aborting on container exit...
===== Job Complete =====

Is there a way to plot the history of the training job? For instances, like the one in the next Figure

Solution

A SageMaker training job in "local" is actually executing inside of a Docker container that is isolated from the Python kernel that is executing your notebook. Therefore, the plt.show() in the train_cnn.py script doesn't actually get routed to the notebook UI in the same way that executing that command directly from a notebook would.

Instead of using plt.show(), consider using plt.savefig() to output the plot to an image:

plt.savefig("training_results.png")

Upon termination of the training container, SageMaker will zip up all the output artifacts (including the plot) and ship them to S3 in your training script. Alternatively, you could upload the plot straight to S3 -- see python - uploading a plot from memory to s3 using matplotlib and boto for an example of this.

As a side note: have you considered using TensorBoard? It can offer a better experience for browsing results of a training script, and SageMaker should have a first class integration to make it easy to enable. Take a look at the run_tensorboard_locally argument.