I’m having some issues trying to access a FileDataset created from two http URIs in an Azure ML Pipeline PythonScriptStep.
In the step, I’m only getting a single file named ['https%3A’]
when doing an os.listdir()
on my mount point. I would have expected two files, with their actual names instead. This happens both when sending the dataset as_upload
and as_mount
. Even happens when I send the dataset reference to the pipeline step and mount it directly from the step.
The dataset is registered in a notebook, the same notebook that creates and invokes the pipeline, as seen below:
tempFileData = Dataset.File.from_files(
['https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg',
'https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg'])
tempFileData.register(ws, name='FileData', create_new_version=True)
#...
read_datasets_step = PythonScriptStep(
name='The Dataset Reader',
script_name='read-datasets.py',
inputs=[fileData.as_named_input('Files'), fileData.as_named_input('Files_mount').as_mount(), fileData.as_named_input('Files_download').as_download()],
compute_target=compute_target,
source_directory='./dataset-reader',
allow_reuse=False,
)
The FileDataset
seems to be registered properly, if I examine it within the notebook I get the following result:
{
"source": [
"https://vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg",
"https://vladiliescu.net/images/reverse-engineering-automated-ml.jpg"
],
"definition": [
"GetFiles"
],
"registration": {
"id": "...",
"name": "FileData",
"version": 4,
"workspace": "Workspace.create(...)"
}
}
For reference, the machine running the notebook is using AML SDK v1.24, whereas the node running the pipeline steps is running v1.25.
Has anybody encountered anything like this? Is there a way to make it work?
Note that I'm specifically looking at file datasets created from web uris, and not necessarily interested in getting a FileDataset
to work with blob storage or similar.
The files should've been mounted at path "https%3A/vladiliescu.net/images/deploying-models-with-azure-ml-pipelines.jpg" and "https%3A/vladiliescu.net/images/reverse-engineering-automated-ml.jpg".
We retain the directory structure following the url structure to avoid potential conflicts.