Search code examples
pysparkazure-synapse

How can I get Synapse Notebook to only process newly added data?


I have data that I plan on uploading to an Azure storage account. My plan is to create a pipeline in Synapse Studio, which will include an Apache Notebook (Using PySpark). The primary objective is to have the Notebook process the data and then save it to a lake database.

The data will be uploaded to an Azure storage container following this format for example: 2022/Week1/Week1.xlsx and 2023/Week10/Week10.xlsx. Initially, I will store and process all historical data in the storage account. After that, the data will be processed and added to the lake database on a weekly basis. Now, the question is, what is the most efficient method to enable the Azure pipeline or the Notebook to identify and process only the newly added data?.


Solution

  • Here's a sample pattern you could use to ensure you don't import the same file twice.

    This uses folders to enforce the state of files.

    Create these folders (in ADLS)

    • Waiting
    • Processing
    • Archive
    • Failed

    Now when your pipeline executes it needs to do the following:

    1. Copy all files from Waiting to Processing
    2. Process all files found in Processing
    3. If the file processed succesfully move it to Archive
    4. If the file did not process succesfully, move it to Failed

    The way this works is, any files to be imported should be copied into the Waiting folder

    After you execute the pipeline every file in the Waiting folder should have been moved out, and therefore won't be processed again when you run the pipeline again

    After execution, files existing in these folders have these meanings

    Waiting - A new file arrived during the ingestion process and will be processed next time

    Processing - something went wrong with your processing step - you need to investigate

    Archive - the file was processed and won't be processed again

    Failed - the file was specifically noted as failed. You need to fix the issue and probably put the file back in Waiting for reprocessing.

    These are just suggested folder names - use whatever you like