with this I successfully created a training job on sagemaker using the Tensorflow Object Detection API in a docker container. Now I'd like to monitor the training job using sagemaker, but cannot find anything explaining how to do it. I don't use a sagemaker notebook.
I think I can do it by saving the logs into a S3 bucket and point there a local tensorboard instance .. but don't know how to tell the tensorflow object detection API where to save the logs (is there any command line argument for this ?).
Something like this, but the script generate_tensorboard_command.py
fails because my training job don't have the sagemaker_submit_directory
parameter..
The fact is when I start the training job nothing is created on my s3 until the job finish and upload everything. There should be a way tell tensorflow where to save the logs (s3) during the training, hopefully without modifying the API source code..
Edit
I can finally make it works with the accepted solution (tensorflow natively supports read/write to s3), there are however additional steps to do:
The only thing is that Tensorflow continuously polls filesystem (i.e. looking for an updated model in serving mode) and this cause useless requests to S3, that you will have to pay (together with a buch of errors in the console). I opened a new question here for this. At least it works.
Edit 2
I was wrong, TF just write logs, is not polling so it's an expected behavior and the extra costs are minimal.
Looking through the example you posted, it looks as though the model_dir
passed to the TensorFlow Object Detection package is configured to /opt/ml/model
:
# These are the paths to where SageMaker mounts interesting things in your container.
prefix = '/opt/ml/'
input_path = os.path.join(prefix, 'input/data')
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
During the training process, tensorboard logs will be written to /opt/ml/model
, and then uploaded to s3 as a final model artifact AFTER training: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-envvariables.html.
You might be able to side-step the SageMaker artifact upload step and point the model_dir
of TensorFlow Object Detection API directly at an s3 location during training:
model_path = "s3://your-bucket/path/here
This means that the TensorFlow library within the SageMaker job is directly writing to S3 instead of the filesystem inside of it's container. Assuming the underlying TensorFlow Object Detection code can write directly to S3 (something you'll have to verify), you should be able to see the tensorboard logs and checkpoints there in realtime.