Search code examples
pythonjoblibazure-machine-learning-service

joblib.dump() fails when saving model to temporary data store in AMLS


I am training a model using AMLS. I have a training pipeline in which step 1 trains a model then saves the output in temporary datastore model_folder using

os.makedirs(output_folder, exist_ok=True)
output_path = output_folder + "/model.pkl"
joblib.dump(value=model, filename=output_path)

Step 2 loads the model and registers it. The model folder is defined in the pipeline as

model_folder = PipelineData("model_folder", datastore=ws.get_default_datastore())

However, step 1 fails when it tries to save the model with the following ServiceError:

Failed to upload outputs due to Exception: Microsoft.RelInfra.Common.Exceptions.OperationFailedException: Cannot upload output xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. ---> Microsoft.WindowsAzure.Storage.StorageException: This request is not authorized to perform this operation using this permission.

How can I solve this? Earlier in my code I had no problem interacting with the default datastore using

default_ds = ws.get_default_datastore()
default_ds.upload_files(...)

My 70_driver_log.txt is as follows:

[2020-08-25T04:03:27.315114] Entering context manager injector.
[context_manager_injector.py] Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['train_word2vec.py', '--output_folder', '/mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder', '--model_type', 'WO', '--training_field', 'task_title', '--regex', '1', '--stopword_removal', '1', '--tokenize_basic', '0', '--remove_punctuation', '0', '--autocorrect', '0', '--lemmatization', '1', '--word_vector_length', '152', '--model_learning_rate', '0.025', '--model_min_count', '0', '--model_window', '7', '--num_epochs', '10'])
Starting the daemon thread to refresh tokens in background for process with pid = 113
Entering Run History Context Manager.
Current directory:  /mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx
Preparing to call script [ train_word2vec.py ] with arguments: ['--output_folder', '/mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder', '--model_type', 'WO', '--training_field', 'task_title', '--regex', '1', '--stopword_removal', '1', '--tokenize_basic', '0', '--remove_punctuation', '0', '--autocorrect', '0', '--lemmatization', '1', '--word_vector_length', '152', '--model_learning_rate', '0.025', '--model_min_count', '0', '--model_window', '7', '--num_epochs', '10']
After variable expansion, calling script [ train_word2vec.py ] with arguments: ['--output_folder', '/mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder', '--model_type', 'WO', '--training_field', 'task_title', '--regex', '1', '--stopword_removal', '1', '--tokenize_basic', '0', '--remove_punctuation', '0', '--autocorrect', '0', '--lemmatization', '1', '--word_vector_length', '152', '--model_learning_rate', '0.025', '--model_min_count', '0', '--model_window', '7', '--num_epochs', '10']

Script type = None
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
OUTPUT FOLDER: /mnt/batch/tasks/shared/LS_root/jobs/aiworkspace/azureml/xxxxx/mounts/workspaceblobstore/azureml/xxxxx/model_folder
Loading SQL data...
Loading abbreviation data...
/azureml-envs/azureml_xxxxx/lib/python3.6/site-packages/pandas/core/indexing.py:1783: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value
Pre-processing data...
Succesfully pre-processed the the text data
Training Word2Vec model...
Saving the model...
Starting the daemon thread to refresh tokens in background for process with pid = 113


The experiment completed successfully. Finalizing run...
[2020-08-25T04:03:52.293994] TimeoutHandler __init__
[2020-08-25T04:03:52.294149] TimeoutHandler __enter__
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.44109439849853516 seconds
[2020-08-25T04:03:52.818991] TimeoutHandler __exit__
2020/08/25 04:04:00 logger.go:293: Process Exiting with Code:  0

My arg parse arguments include

parser.add_argument('--output_folder', type=str, dest='output_folder', default="output_folder", help='output folder')

Solution

  • Fixed this problem by adding my AMLS workspace to a 'storage blob data contributor' role in the AMLS default storage account. It seemly like usually this role is added by default, but it didn't happen in my case.