Recently, I have changed account on AWS and faced with weird error in Sagemaker.
Basically, I'm just checking xgboost
algo with some toy dataset in this manner:
from sagemaker import image_uris
xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
clf = sagemaker.estimator.Estimator(xgb_image_uri,
role, 1, 'ml.c4.2xlarge',
output_path="s3://{}/output".format(session.default_bucket()),
sagemaker_session=session)
clf.fit(location_data)
Then the training job is starting to be executed but for some reason, on downloading data step it stops the training job and displays the following message:
2021-10-21 17:33:27 Downloading - Downloading input data
2021-10-21 17:33:27 Stopping - Stopping the training job
2021-10-21 17:33:27 Stopped - Training job stopped
ProfilerReport-1634837444: Stopping
..
Job ended with status 'Stopped' rather than 'Completed'. This could mean the job timed out or stopped early for some other reason: Consider checking whether it completed as you expect.
Also, when I'm trying to go back to training jobs section and check for logs in cloudwatch there is nothing to be displayed. Is it common issue and who had faced with that? Are there any workarounds?
The problem was most likely with templates for sagemaker that was runned before creating the instance.