Search code examples
azure-machine-learning-service

AzureML create dataset from datastore with multiple files - path not valid


I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?

Here is the error I get following the documented approach in the UI

When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:

from azureml.core import Workspace, Dataset

# set variables 

workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
  StreamAccessException was caused by NotFoundException.
    Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'

When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?

Here is my setup:

I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.

In AzureML I created a datastore testdatastore and there I select the data container in my data storage.

Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.

I got the same issue when creating the dataset and datastore in python:

import azureml.core
from azureml.core import Workspace, Datastore, Dataset

ws = Workspace.from_config()

datastore = Datastore(ws, "mydatastore")

datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")

Solution

  • I uploaded and registered the files with this script and everything works as expected.

    from azureml.core import Datastore, Dataset, Workspace
    
    import logging
    
    logger = logging.getLogger(__name__)
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
    )
    
    datastore_name = "mydatastore"
    dataset_path_on_disk = "./data/images_greyscale"
    dataset_path_in_datastore = "images_greyscale"
    
    azure_dataset_name = "images_grayscale"
    azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
    
    
    workspace = Workspace.from_config()
    datastore = Datastore.get(workspace, datastore_name=datastore_name)
    
    logger.info("Uploading data...")
    datastore.upload(
        src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
    )
    logger.info("Uploading data done.")
    
    logger.info("Registering dataset...")
    datastore_path = [(datastore, dataset_path_in_datastore)]
    dataset = Dataset.File.from_files(path=datastore_path)
    dataset.register(
        workspace=workspace,
        name=azure_dataset_name,
        description=azure_dataset_description,
        create_new_version=True,
    )
    logger.info("Registering dataset done.")