python pandas conda nan azure-machine-learning-service

How to track the version of a Python package used by Azure Machine Learning?

My team has deployed a Python script onto Azure Machine Learning (AML): among other things, this script processes data stored in a pandas dataframe. Lately, the script suddenly stopped working, returning an error related to the pd.NA values used to denote missing data.

Replacing the pd.NA values in the pandas dataframe with np.nan fixed this issue, but it is still unclear how this error happened in the first place.

According to pandas' webpage concerning missing data:

Experimental: the behaviour of pd.NA can still change without warning.

But our dependencies have not changed lately; for instance, we have been using pandas 1.1 since the beginning of our project. Our conda.yml file also stayed unchanged over the past few months.

Is it possible that Azure Machine Learning uses versions of toolboxes and packages different than those specified in the conda.yml file? If so, how to keep track of the version in use?

Solution

AzureML never injects or updates user environment.

What you might experience is a result of bad practices managing environments. For reproducibility, please pin all the packages you depend on to a compatible versions. That won't completely eliminate the issue, but it would significantly decrease the chance of the breaking changes introduced from your dependencies including nesting ones.

You can eliminate reproducibility issues with AzureML by using immutable reference to your base image. If you using AzureML base image, date based tags are immutable . In this case derived image won't be invalidated with the base image update that would trigger a rebuild that might bring breaking changes from your dependencies. Derived image will cached in the instance of Azure Container Registry associated with your workspace until you rebuild it or manually delete it.