I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.
The data will be uploaded to an Azure storage container following this format for example: 2022/Week1/Week1.xlsx
and 2023/Week10/Week10.xlsx
. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.
Here's a sample pattern you could use to ensure you don't import the same file twice.
This uses folders to enforce the state of files.
Create these folders (in ADLS)
Now when your pipeline executes it needs to do the following:
The way this works is, any files to be imported should be copied into the Waiting folder
After you execute the pipeline every file in the Waiting folder should have been moved out, and therefore won't be processed again when you run the pipeline again
After execution, files existing in these folders have these meanings
Waiting - A new file arrived during the ingestion process and will be processed next time
Processing - something went wrong with your processing step - you need to investigate
Archive - the file was processed and won't be processed again
Failed - the file was specifically noted as failed. You need to fix the issue and probably put the file back in Waiting for reprocessing.
These are just suggested folder names - use whatever you like