Search code examples
azure-data-lakeazure-data-factory

Azure Data Factory Incremental Load data by using Copy Activity


I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.

after that i copied all the final data sink to staging data lake to stored CSV format.

I am facing two kind of issues here.

  1. when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.

  2. coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.

Kindly suggest.


Solution

  • When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.

    Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.

    Look at following demonstration to understand how upsert works. I used azure SQL database as an example.

    My initial table data:

    create table player(id int, gname varchar(20), team varchar(10))
    

    enter image description here


    My source csv data (data I want to incrementally load):

    • I have taken an id which already exists in target table (id=1) and another which is new (id=4).

    enter image description here


    My copy data sink configuration:

    • Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.

    enter image description here


    Table after upsert using Copy data:

    • Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.

    enter image description here


    You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.