Search code examples
pythonazure-machine-learning-service

How to only load one portion of an AzureML tabular dataset (linked to Azure Blob Storage)


I have a DataSet defined in my AzureML workspace that is linked to an Azure Blob Storage csv file of 1.6Gb. This file contains timeseries information of around 10000 devices. So, I could've also created 10000 smaller files (since I use ADF for the transmission pipeline).

My question now is: is it possible to load a part of the AzureML DataSet in my python notebook or script instead of loading the entire file?
The only code I have now load the full file:

dataset = Dataset.get_by_name(workspace, name='devicetelemetry')
df = dataset.to_pandas_dataframe()

The only concept of partitions I found with regards to the AzureML datasets was around time series and partitioning of timestamps & dates. However, here I would love to partition per device, so I can very easily just do a load of all telemetry of a specific device.

Any pointers to docs or any suggestions? (I couldn't find any so far)

Thanks already


Solution

  • You're right there are the .time_*() filtering methods available with a TabularDataset.

    I'm not aware of anyway to do filtering as you suggest (but I agree it would be a useful feature). To get per-device partitioning, my recommendation would be to structure your container like so:

    - device1
        - 2020
            - 2020-03-31.csv
            - 2020-04-01.csv
    - device2
       - 2020
            - 2020-03-31.csv
            - 2020-04-01.csv
    

    In this way you can define an all-up Dataset, but also per-device Datasets by passing folder of the device to the DataPath

    # all up dataset
    ds_all = Dataset.Tabular.from_delimited_files(
        path=DataPath(datastore, '*')
    )
    # device 1 dataset
    ds_d1 = Dataset.Tabular.from_delimited_files(
        path=DataPath(datastore, 'device1/*')
    )
    

    CAVEAT

    dataprep SDK is optimized for blobs around 200MB in size. So you can work with many small files, but sometimes it can be slower than expected, especially considering the overhead of enumerating all blobs in a container.