How can I get Synapse Notebook to only process newly added data?

I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.

The data will be uploaded to an Azure storage container following this format for example: 2022/Week1/Week1.xlsx and 2023/Week10/Week10.xlsx. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.

Solution

Here's a sample pattern you could use to ensure you don't import the same file twice.

This uses folders to enforce the state of files.

Create these folders (in ADLS)

Waiting
Processing
Archive
Failed

Now when your pipeline executes it needs to do the following:

Copy all files from Waiting to Processing
Process all files found in Processing
If the file processed succesfully move it to Archive
If the file did not process succesfully, move it to Failed

The way this works is, any files to be imported should be copied into the Waiting folder

After you execute the pipeline every file in the Waiting folder should have been moved out, and therefore won't be processed again when you run the pipeline again

After execution, files existing in these folders have these meanings

Waiting - A new file arrived during the ingestion process and will be processed next time

Processing - something went wrong with your processing step - you need to investigate

Archive - the file was processed and won't be processed again

Failed - the file was specifically noted as failed. You need to fix the issue and probably put the file back in Waiting for reprocessing.

These are just suggested folder names - use whatever you like