I have a DataSet defined in my AzureML workspace that is linked to an Azure Blob Storage csv file of 1.6Gb. This file contains timeseries information of around 10000 devices. So, I could've also created 10000 smaller files (since I use ADF for the transmission pipeline).
My question now is: is it possible to load a part of the AzureML DataSet in my python notebook or script instead of loading the entire file?
The only code I have now load the full file:
dataset = Dataset.get_by_name(workspace, name='devicetelemetry')
df = dataset.to_pandas_dataframe()
The only concept of partitions I found with regards to the AzureML datasets was around time series and partitioning of timestamps & dates. However, here I would love to partition per device, so I can very easily just do a load of all telemetry of a specific device.
Any pointers to docs or any suggestions? (I couldn't find any so far)
Thanks already
You're right there are the .time_*()
filtering methods available with a TabularDataset
.
I'm not aware of anyway to do filtering as you suggest (but I agree it would be a useful feature). To get per-device partitioning, my recommendation would be to structure your container like so:
- device1
- 2020
- 2020-03-31.csv
- 2020-04-01.csv
- device2
- 2020
- 2020-03-31.csv
- 2020-04-01.csv
In this way you can define an all-up Dataset, but also per-device Datasets by passing folder of the device to the DataPath
# all up dataset
ds_all = Dataset.Tabular.from_delimited_files(
path=DataPath(datastore, '*')
)
# device 1 dataset
ds_d1 = Dataset.Tabular.from_delimited_files(
path=DataPath(datastore, 'device1/*')
)
CAVEAT
dataprep SDK is optimized for blobs around 200MB in size. So you can work with many small files, but sometimes it can be slower than expected, especially considering the overhead of enumerating all blobs in a container.