Search code examples
azure-storageazure-blob-storageazure-data-factoryazure-logic-appsazure-eventgrid

BlobCreated event only when a blob is completely commited


I want to create an EventGrid via Arm deployment to ingest to ADX cluster once a all blobs in the container were added . (it's a bunch of blobs created each certain time interval)

I want to make sure that the event is raised only when a blob is created and its content has been finished written to. (not just a "blob created file").

The Storage is Data Lake Storage V2 which mean it has "Hierarchical namespace"

I'm looking at this link:

https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage?tabs=event-grid-event-schema

I'm confused between the: List of events for Blob REST APIs section and List of the events for Azure Data Lake Storage Gen 2 REST APIs section.

Which one is relevant to me and can I achieve my goal or is there a way which I can circumvent it to achieve my goal??


Solution

  • I have seen various ways to circumvent this, some good, some bad. Just depends on what works for you.

    1. You could Wait for a specified amount of time when the BlobEvent trigger is first triggered. If you have small files that you know should never take more than 2 or 3 minutes then this is an option. If you don't know if it will take 5 minutes or 30 minutes for the Blob to be fully loaded, then this is a terrible option. I personally dislike this option, but it works for some. The Wait Method

    2. Stage the Blobs using stream or direct copy, then move them over via direct copy that does not fire the trigger before it's fully committed. I personally liked this method because it allowed Blobs to be moved into Staging, then Metadata was set on them, then Blob Copy command was executed on them to copy over with the metadata included. (FYI, ADF has functionality to set metadata within the copy activity, but this same principal can be applied to Azure Functions, Powershell, C#, whatever code is putting the blobs out there.)

    Stage Blob, perform action, then copy to final container destination

    1. You could theoretically use an UNTIL activity in ADF to check the status of the incoming Blob and once it is either no longer growing, has a particular status, maybe a particular last modified (something like these) then you can continue on in your pipeline. This method I have not tested, but I expect that it would be possible to continually get the metadata of the file and use that to determine if the pipeline is ready to continue on.