Search code examples
azure-table-storageazure-machine-learning-service

How can I import into Azure machine learning studio from azure table storage with ODATA query?


The Import Data module for Azure Table documention can be found here: https://msdn.microsoft.com/en-us/library/azure/mt674699

In there it mentions that:

The Import Data module does not support filtering as data is being read. The exception is reading from data feeds, which sometimes allow you to specify a filter condition as part of the feed URL.

There is a large amount of data in our table storage and it is not feasible to re-download the entire data set each time we run the experiment. I'm aware that there is the option to cache the data, however there is new data constantly being inserted and we would like to be able to use the new data whenever the experiment is run.

Is there an alternative to the Import Data module that we could use to get table storage data with an ODATA query instead?


Solution

  • There is no generic way to incrementally update a dataset.

    However, depending on what you want to do with the data, there are different options for adding new data:

    The Add Rows module effectively concatenates two datasets. So you could use the old, cached dataset on the left-hand input and add the new data on the right-hand input. That way you only have to read in the new data. However, you would have to create some complex logic for figuring out which rows were new and old, and then maintain that outside Azure ML.

    You could create an OData feed based on table storage, to enable filtering and get the new data that way. Just be aware that right now only public feeds are supported. And you would have to use Join or Add Rows to recombine the old and new data as described above.

    You might also look into ways of using the table names, partitions, and rowkeys to chunk your data.

    If you are retraining a model and you want to update your feature statistics, the Learning with Counts modules support incremental updates of count-based features.