I have a model that was trained on a Machine Learning Compute on Azure Machine Learning Service. The registered model already lives my workspace and I would like to deploy it to a pre-existing AKS instance that I previously provisioned in my workspace. I am able to successfully configure and register the container image:
# retrieve cloud representations of the models
rf = Model(workspace=ws, name='pumps_rf')
le = Model(workspace=ws, name='pumps_le')
ohc = Model(workspace=ws, name='pumps_ohc')
print(rf); print(le); print(ohc)
<azureml.core.model.Model object at 0x7f66ab3b1f98>
<azureml.core.model.Model object at 0x7f66ab7e49b0>
<azureml.core.model.Model object at 0x7f66ab85e710>
package_list = [
'category-encoders==1.3.0',
'numpy==1.15.0',
'pandas==0.24.1',
'scikit-learn==0.20.2']
# Conda environment configuration
myenv = CondaDependencies.create(pip_packages=package_list)
conda_yml = 'file:'+os.getcwd()+'/myenv.yml'
with open(conda_yml,"w") as f:
f.write(myenv.serialize_to_string())
Configuring and registering the image works:
# Image configuration
image_config = ContainerImage.image_configuration(execution_script='score.py',
runtime='python',
conda_file='myenv.yml',
description='Pumps Random Forest model')
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
Creating image
Running....................
SucceededImage creation operation finished for image pumpsrfimage:2, operation "Succeeded"
Attaching to an existing cluster also works:
# Attach the cluster to your workgroup
attach_config = AksCompute.attach_configuration(resource_group = Config.RESOURCE_GROUP,
cluster_name = Config.DEPLOY_COMPUTE)
aks_target = ComputeTarget.attach(workspace=ws,
name=Config.DEPLOY_COMPUTE,
attach_configuration=attach_config)
# Wait for the operation to complete
aks_target.wait_for_completion(True)
SucceededProvisioning operation finished, operation "Succeeded"
However, when I try to deploy the image to the existing cluster, it fails with a WebserviceException
.
# Set configuration and service name
aks_config = AksWebservice.deploy_configuration()
# Deploy from image
service = Webservice.deploy_from_image(workspace = ws,
name = 'pumps-aks-service-1' ,
image = image,
deployment_config = aks_config,
deployment_target = aks_target)
# Wait for the deployment to complete
service.wait_for_deployment(show_output = True)
print(service.state)
WebserviceException: Unable to create service with image pumpsrfimage:1 in non "Succeeded" creation state.
---------------------------------------------------------------------------
WebserviceException Traceback (most recent call last)
<command-201219424688503> in <module>()
7 image = image,
8 deployment_config = aks_config,
----> 9 deployment_target = aks_target)
10 # Wait for the deployment to complete
11 service.wait_for_deployment(show_output = True)
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/webservice.py in deploy_from_image(workspace, name, image, deployment_config, deployment_target)
284 return child._deploy(workspace, name, image, deployment_config, deployment_target)
285
--> 286 return deployment_config._webservice_type._deploy(workspace, name, image, deployment_config, deployment_target)
287
288 @staticmethod
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/aks.py in _deploy(workspace, name, image, deployment_config, deployment_target)
Any ideas on how to solve this issue? I am writing the code in a Databricks notebook. Also, I am able to create and deploy the cluster using Azure Portal no problem so this appears to be an issue with my code/Python SDK or the way Databricks works with AMLS.
UPDATE: I was able to deploy my image to AKS using Azure Portal and the webservice works as expected. This means the issue lies somewhere between Databricks, the Azureml Python SDK and Machine Learning Service.
UPDATE 2: I'm working with Microsoft to fix this issue. Will report back once we have a solution.
In my initial code, when creating the image, I was not using:
image.wait_for_creation(show_output=True)
As a consequence, I was calling CreateImage
and DeployImage
before the image was created which errored out. Can't believe it was that simple..
UPDATED IMAGE CREATION SNIPPET:
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
image.wait_for_creation(show_output=True)