Search code examples
python-3.xdeep-learninglstmamazon-sagemakerhyperparameters

AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters


When I try to use hyperparameters tuning on Sagemaker I get this error:

UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.

When I look up the logs on CloudWatch all 5 failed training jobs have the same error at the end:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/ml/code/train.py", line 117, in <module>
    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
  File "/usr/lib/python3.5/os.py", line 725, in __getitem__
    raise KeyError(key) from None

and

KeyError: 'SM_CHANNEL_TRAINING'

The problem is at the Step 4 of the project: https://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb

Would hihgly appreciate any hints on where to look next


Solution

  • In your train.py file, changing the environment variable from

    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])

    to

    parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) should address the issue.

    This is the case with Torch's framework_version 1.3.1 but other versions might also be affected. Here is the link for your reference.