Search code examples
azureazure-functionsazure-pipelinesazure-synapseazure-data-lake-gen2

Triggered copy data from blob to ADLS extracting path from filename


I am trying to centralize our data to a ADLSgen2 data lake. One of our datasets is 'dumped' in a blob storage and I want to have a triggered copy.

The files that are stored in the blob storage have a data as a filename (can be a arbitrary date) in JSON format. What I want is that new files are (binary) copied to a folder on the data lake with path using pieces of the date that are present in the filename.

2020-01-01.jsonraw/blob/2020/01/raw_reports_blob_2020-01-01.json

First I tried a data copy job and a Pipeline in Azure Synapse but I am not sure how to set the sink path with details from source filename. It seems that the copy-data-tool cannot be triggered by new blob files. The pipeline method looks pretty powerful and I guess it is possible. What I want is not that difficult on Linux so I guess it must be possible in Azure as well.

Second, I tried to create an Azure Function as I am pretty comfortable with Python, however here I have a similar problem as I need to define in/out bindings. The out bindings are defined at design time and do not give me the freedom to the kind of path based on the source filename. Also, it feels somewhat overkill for a simple binary copy action. I can have the function triggered with new files in the blob and reading them is no problem.

I am relatively new to Azure and any help towards a solution is more than welcome.


Solution

  • See this answer as well: https://stackoverflow.com/a/66393471/496289


    There is concept of "copy" per sē in ADLS. You read/download from source and write/upload to target.

    As someone mentioned Data Factory can do this.

    You can also use:

    • azcopy from a Power Shell Azure Function. azcopy cp "https://[srcaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]"
    • Python/Java/... Azure Function. You'll have to download the file (in chunks if it's big) and upload it (in chunks if big).
    • Databricks. This would be similar misuse of a tool as using Azure Synapse Analytics to copy data between storage accounts.
    • Azure Logic apps. See this and this. Never used them, but I believe they are less code than Azure Function and have some programming capabilities, if it helps you create destination path programmatically.

    Things to remember:

    • Data Factory, can be expensive. Especially compared to Azure Functions on consumption plan.
    • Azure Functions on consumption plan have 10 minute max before they timeout. So can't use it if you have files in GBs/TBs.
    • You'll be paying egress costs if applicable.