Search code examples
azure-machine-learning-service

Issues accessing a FileDataset created from HTTP URIs in a PythonScriptStep


I’m having some issues trying to access a FileDataset created from two http URIs in an Azure ML Pipeline PythonScriptStep.

In the step, I’m only getting a single file named ['https%3A’] when doing an os.listdir() on my mount point. I would have expected two files, with their actual names instead. This happens both when sending the dataset as_upload and as_mount. Even happens when I send the dataset reference to the pipeline step and mount it directly from the step.

The dataset is registered in a notebook, the same notebook that creates and invokes the pipeline, as seen below:

tempFileData = Dataset.File.from_files(
        ['https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg',
        'https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg'])
tempFileData.register(ws, name='FileData', create_new_version=True)

#...

read_datasets_step = PythonScriptStep(
    name='The Dataset Reader',
    script_name='read-datasets.py',
    inputs=[fileData.as_named_input('Files'), fileData.as_named_input('Files_mount').as_mount(), fileData.as_named_input('Files_download').as_download()],
    compute_target=compute_target,
    source_directory='./dataset-reader',
    allow_reuse=False,
)

The FileDataset seems to be registered properly, if I examine it within the notebook I get the following result:

{
  "source": [
    "https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg",
    "https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg"
  ],
  "definition": [
    "GetFiles"
  ],
  "registration": {
    "id": "...",
    "name": "FileData",
    "version": 4,
    "workspace": "Workspace.create(...)"
  }
}

For reference, the machine running the notebook is using AML SDK v1.24, whereas the node running the pipeline steps is running v1.25.

Has anybody encountered anything like this? Is there a way to make it work?

Note that I'm specifically looking at file datasets created from web uris, and not necessarily interested in getting a FileDataset to work with blob storage or similar.


Solution

  • The files should've been mounted at path "https%3A/vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg" and "https%3A/vladiliescu.net/images/reverse-engineering-automated-ml.jpg".

    We retain the directory structure following the url structure to avoid potential conflicts.