Search code examples
azure-machine-learning-service

Azure ML SDK DataReference - File Pattern - MANY files


I’m building out a pipeline that should execute and train fairly frequently. I’m following this: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-create-your-first-pipeline

Anyways, I’ve got a stream analytics job dumping telemetry into .json files on blob storage (soon to be adls gen2). Anyways, I want to find all .json files and use all of those files to train with. I could possibly use just new .json files as well (interesting option honestly).

Currently I just have the store mounted to a data lake and available; and it just iterates the mount for the data files and loads them up.

  1. How can I use data references for this instead?
  2. What does data references do for me that mounting time stamped data does not? a. From an audit perspective, I have version control, execution time and time stamped read only data. Albeit, doing a replay on this would require additional coding, but is do-able.

Solution

  • You could pass pointer to folder as an input parameter for the pipeline, and then your step can mount the folder to iterate over the json files.