amazon-web-services docker pytorch amazon-sagemaker entry-point

What to define as entrypoint when initializing a pytorch estimator with a custom docker image for training on AWS Sagemaker?

So I created a docker image for training. In the dockerfile I have an entrypoint defined such that when docker run is executed, it will start running my python code. To use this on aws sagemaker in my understanding I need to create a pytorch estimator in a jupyter notebook in sagemaker. I tried something like this:

import sagemaker
from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()

role = sagemaker.get_execution_role()

estimator = PyTorch(entry_point='train.py',
                    role=role,
                    framework_version='1.3.1',
                    image_name='xxx.ecr.eu-west-1.amazonaws.com/xxx:latest',
                    train_instance_count=1,
                    train_instance_type='ml.p3.xlarge',
                    hyperparameters={})

estimator.fit({})

In the documentation I found that as image name I can specify the link the my docker image on aws ecr. When I try to execute this it keeps complaining

[Errno 2] No such file or directory: 'train.py'

It complains immidiatly, so surely I am doing something completely wrong. I would expect that first my docker image should run, and than it could find out that the entry point does not exist.

But besides this, why do I need to specify an entry point, as in, should it not be clear that the entry to my training is simply docker run?

For maybe better understanding. The entrypoint python file in my docker image looks like this:

if __name__=='__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--epochs', type=int, default=5)
    parser.add_argument('--batch_size', type=int, default=16)
    parser.add_argument('--learning_rate', type=float, default=0.0001)

    # Data and output directories
    parser.add_argument('--output_data_dir', type=str, default=os.environ['OUTPUT_DATA_DIR'])
    parser.add_argument('--train_data_path', type=str, default=os.environ['CHANNEL_TRAIN'])
    parser.add_argument('--valid_data_path', type=str, default=os.environ['CHANNEL_VALID'])

    # Start training
    ...

Later I would like to specify the hyperparameters and data channels. But for now I simply do not understand what to put as entry point. In the documentation it says that the entrypoint is required and it should be a local/global path to the entrypoint...

Solution

If you really would like to use a complete separate by yourself build docker image, you should create an Amazon Sagemaker algorithm (which is one of the options in the Sagemaker menu). Here you have to specify a link to your docker image on amazon ECR as well as the input parameters and data channels etc. When choosing this options, you should not use the PyTorch estimater but the Algoritm estimater. This way you indeed don't have to specify an entrypoint because it simple runs the docker when training and the default entrypoint can be defined in your docker file.

The Pytorch estimator can be used when having you own model code, but you would like to run this code in an off-the-shelf Sagemaker PyTorch docker image. This is why you have to for example specify the PyTorch framework version. In this case the entrypoint file by default should be placed next to where your jupyter notebook is stored (just upload the file by clicking on the upload button). The PyTorch estimator inherits all options from the framework estimator where options can be found where to place the entrypoint and model, for example source_dir.