Search code examples
azureazure-aksazure-sdk-pythonazure-machine-learning-service

Azure-ML Deployment does NOT see AzureML Environment (wrong version number)


I've followed the documentation pretty well as outlined here.

I've setup my azure machine learning environment the following way:

from azureml.core import Workspace

# Connect to the workspace
ws = Workspace.from_config()

from azureml.core import Environment
from azureml.core import ContainerRegistry

myenv = Environment(name = "myenv")

myenv.inferencing_stack_version = "latest"  # This will install the inference specific apt packages.

# Docker
myenv.docker.enabled = True
myenv.docker.base_image_registry.address = "myazureregistry.azurecr.io"
myenv.docker.base_image_registry.username = "myusername"
myenv.docker.base_image_registry.password = "mypassword"
myenv.docker.base_image = "4fb3..." 
myenv.docker.arguments = None

# Environment variable (I need python to look at folders 
myenv.environment_variables = {"PYTHONPATH":"/root"}

# python
myenv.python.user_managed_dependencies = True
myenv.python.interpreter_path = "/opt/miniconda/envs/myenv/bin/python" 

from azureml.core.conda_dependencies import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-defaults")
myenv.python.conda_dependencies=conda_dep

myenv.register(workspace=ws) # works!

I have a score.py file configured for inference (not relevant to the problem I'm having)...

I then setup inference configuration

from azureml.core.model import InferenceConfig
inference_config = InferenceConfig(entry_script="score.py", environment=myenv)

I setup my compute cluster:

from azureml.core.compute import ComputeTarget, AksCompute
from azureml.exceptions import ComputeTargetException

# Choose a name for your cluster
aks_name = "theclustername" 

# Check to see if the cluster already exists
try:
    aks_target = ComputeTarget(workspace=ws, name=aks_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    prov_config = AksCompute.provisioning_configuration(vm_size="Standard_NC6_Promo")

    aks_target = ComputeTarget.create(workspace=ws, name=aks_name, provisioning_configuration=prov_config)

    aks_target.wait_for_completion(show_output=True)

from azureml.core.webservice import AksWebservice

# Example
gpu_aks_config = AksWebservice.deploy_configuration(autoscale_enabled=False,
                                                    num_replicas=3,
                                                    cpu_cores=4,
                                                    memory_gb=10)

Everything succeeds; then I try and deploy the model for inference:

from azureml.core.model import Model

model = Model(ws, name="thenameofmymodel")

# Name of the web service that is deployed
aks_service_name = 'tryingtodeply'

# Deploy the model
aks_service = Model.deploy(ws,
                           aks_service_name,
                           models=[model],
                           inference_config=inference_config,
                           deployment_config=gpu_aks_config,
                           deployment_target=aks_target,
                           overwrite=True)

aks_service.wait_for_deployment(show_output=True)
print(aks_service.state)

And it fails saying that it can't find the environment. More specifically, my environment version is version 11, but it keeps trying to find an environment with a version number that is 1 higher (i.e., version 12) than the current environment:

FailedERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: 0f03a025-3407-4dc1-9922-a53cc27267d4
More information can be found here: 
Error:
{
  "code": "BadRequest",
  "statusCode": 400,
  "message": "The request is invalid",
  "details": [
    {
      "code": "EnvironmentDetailsFetchFailedUserError",
      "message": "Failed to fetch details for Environment with Name: myenv Version: 12."
    }
  ]
}

I have tried to manually edit the environment JSON to match the version that azureml is trying to fetch, but nothing works. Can anyone see anything wrong with this code?

Update

Changing the name of the environment (e.g., my_inference_env) and passing it to InferenceConfig seems to be on the right track. However, the error now changes to the following

Running..........
Failed
ERROR - Service deployment polling reached non-successful terminal state, current service state: Failed
Operation ID: f0dfc13b-6fb6-494b-91a7-de42b9384692
More information can be found here: https://some_long_http_address_that_leads_to_nothing
Error:
{
  "code": "DeploymentFailed",
  "statusCode": 404,
  "message": "Deployment not found"
}

Solution

The answer from Anders below is indeed correct regarding the use of azure ML environments. However, the last error I was getting was because I was setting the container image using the digest value (a sha) and NOT the image name and tag (e.g., imagename:tag). Note the line of code in the first block:

myenv.docker.base_image = "4fb3..." 

I reference the digest value, but it should be changed to

myenv.docker.base_image = "imagename:tag"

Once I made that change, the deployment succeeded! :)


Solution

  • One concept that took me a while to get was the bifurcation of registering and using an Azure ML Environment. If you have already registered your env, myenv, and none of the details of the your environment have changed, there is no need re-register it with myenv.register(). You can simply get the already register env using Environment.get() like so:

    myenv = Environment.get(ws, name='myenv', version=11)
    

    My recommendation would be to name your environment something new: like "model_scoring_env". Register it once, then pass it to the InferenceConfig.