Search code examples
boto3amazon-emramazon-sagemaker

Is there a way to use sagemaker lifecycle configuration to run an EMR cluster on notebook start


I would like to start an EMR cluster every time a sagemaker notebook is started. However, I have find out that Lifecycle configuration scripts cannot run for longer than 5 minutes. Sadly my EMR cluster take more then 5 minutes to be up. This is a problem as I need to wait for the cluster to be up in order to retrieve the master ip address (that ip address is then used to configure the connection between sagemaker notebbok and the cluster).

Below an extract of the code which is run into the lifecycle configuration script.

Is there someone out there which have faced a similar problem and found a solution ?

job_flow_id = client.run_job_flow(**CLUSTER_CONFIG)['JobFlowId']

...
...

# Retrieve private Ip of master node for later use
master_instance = client.list_instances(ClusterId=job_flow_id, InstanceGroupTypes=['MASTER'])['Instances'][0]
master_private_ip = master_instance['PrivateIpAddress']

# Send to sagemaker the config file in order to tell him how to communicate with spark
s3 = boto3.client('s3')
file_object = s3.get_object(Bucket='dataengine', Key='emr/example_config.json')
data = json.loads(file_object['Body'].read().decode('utf-8'))
data['kernel_python_credentials']['url'] = 'http://{}:8998'.format(master_private_ip)
data['kernel_scala_credentials']['url'] = 'http://{}:8998'.format(master_private_ip)
data['kernel_r_credentials']['url'] = 'http://{}:8998'.format(master_private_ip)

with open('./sparkmagic/config.json', 'w') as outfile:
    json.dump(data, outfile)```

Solution

  • You can also consider using nohup in Lifecycle config to execute your script in background, so you don't get blocked by 5-min limit.

    Let us know if there's anything else you need assistance of.

    Thanks,

    Han